|
[1]
|
Michael Hahsler, Sudheer Chelluboina, Kurt Hornik, and Christian Buchta.
The arules R-package ecosystem: Analyzing interesting patterns from
large transaction datasets.
Journal of Machine Learning Research, 12:1977-1981, 2011.
[ bib |
at the publisher ]
This paper describes the ecosystem of R add-on packages developed around
the infrastructure provided by the package arules. The packages
provide comprehensive functionality for analyzing interesting patterns
including frequent itemsets, association rules, frequent sequences and
for building applications like associative classification. After
discussing the ecosystem's design we illustrate the ease of mining
and visualizing rules with a short example.
|
|
[2]
|
Michael Hahsler and Sudheer Chelluboina.
Visualizing association rules in hierarchical groups.
In 42nd Symposium on the Interface: Statistical, Machine
Learning, and Visualization Algorithms (Interface 2011). The Interface
Foundation of North America, 2011.
[ bib |
.pdf ]
Association rule mining is one of the most popular data mining methods.
However, mining association rules often results in a very large number
of found rules, leaving the analyst with the task to go through all the
rules and discover interesting ones. Sifting manually through large
sets of rules is time consuming and strenuous. Visualization has a long
history of making large amounts of data better accessible using
techniques like selecting and zooming. However, most association rule
visualization techniques are still falling short when it comes to a
large number of rules. In this paper we present a new interactive
visualization technique which lets the user navigate through a
hierarchy of groups of association rules. We demonstrate how this new
visualization techniques can be used to analyze a large sets of
association rules with examples from our implementation in the
R-package arulesViz.
|
|
[3]
|
Michael Hahsler, Christian Buchta, and Kurt Hornik.
Selective association rule generation.
Computational Statistics, 23(2):303-315, April 2008.
[ bib |
DOI |
at the publisher |
.pdf ]
Mining association rules is a popular and well researched
method for discovering interesting relations between variables in
large databases. A practical problem is that at medium to low support
values often a large number of frequent itemsets and an even larger
number of association rules are found in a database. A widely used
approach is to gradually increase minimum support and minimum
confidence or to filter the found rules using increasingly strict
constraints on additional measures of interestingness until the set of
rules found is reduced to a manageable size. In this paper we describe
a different approach which is based on the idea to first define a set
of “interesting” itemsets (e.g., by a mixture of mining and expert
knowledge) and then, in a second step to selectively generate rules
for only these itemsets. The main advantage of this approach over
increasing thresholds or filtering rules is that the number of rules
found is significantly reduced while at the same time it is not
necessary to increase the support and confidence thresholds which
might lead to missing important information in the database.
|
|
[4]
|
Michael Hahsler and Kurt Hornik.
Building on the arules infrastructure for analyzing transaction data
with R.
In R. Decker and H.-J. Lenz, editors, Advances in Data Analysis,
Proceedings of the 30th Annual Conference of the Gesellschaft für
Klassifikation e.V., Freie Universität Berlin, March 8-10, 2006, Studies
in Classification, Data Analysis, and Knowledge Organization, pages 449-456.
Springer-Verlag, 2007.
[ bib |
at the publisher |
.pdf ]
The free and extensible statistical computing environment R with its
enormous number of extension packages already provides many state-of-the-art
techniques for data analysis. Support for association rule mining,
a popular exploratory method which can be used, among other purposes,
for uncovering cross-selling opportunities in market baskets,
has become available recently with the R extension package arules.
After a brief introduction to transaction data and association rules,
we present the formal framework implemented in arules and demonstrate
how clustering and association rule mining can be applied together
using a market basket data set from a typical retailer. This paper
shows that implementing a basic infrastructure with formal classes
in R provides an extensible basis which can very efficiently be employed
for developing new applications (such as clustering transactions)
in addition to association rule mining.
|
|
[5]
|
Michael Hahsler and Kurt Hornik.
New probabilistic interest measures for association rules.
Intelligent Data Analysis, 11(5):437-455, 2007.
[ bib |
at the publisher |
.pdf ]
Mining association rules is an important technique for discovering
meaningful patterns in transaction databases. Many different measures
of interestingness have been proposed for association rules. However,
these measures fail to take the probabilistic properties of the mined
data into account. In this paper, we start with presenting a simple
probabilistic framework for transaction data which can be used to
simulate transaction data when no associations are present. We use
such data and a real-world database from a grocery outlet to explore
the behavior of confidence and lift, two popular interest measures
used for rule mining. The results show that confidence is systematically
influenced by the frequency of the items in the left hand side of
rules and that lift performs poorly to filter random noise in transaction
data. Based on the probabilistic framework we develop two new interest
measures, hyper-lift and hyper-confidence, which can be used to filter
or order mined association rules. The new measures show significantly
better performance than lift for applications where spurious rules
are problematic.
|
|
[6]
|
Thomas Reutterer, Michael Hahsler, and Kurt Hornik.
Data Mining und Marketing am Beispiel der explorativen
Warenkorbanalyse.
Marketing ZFP, 29(3):165-181, 2007.
[ bib |
at the publisher ]
Techniken des Data Mining stellen für die Marketingforschung
und -praxis eine zunehmend bedeutsamere Bereicherung des
herkömmlichen Methodenarsenals dar. Mit dem Einsatz solcher
primär datengetriebener Analysewerkzeuge wird das Ziel verfolgt,
marketingrelevante Informationen ”intelligent” aus
großen Datenbanken (sog. Data Warehouses) zu extrahieren und
für die weitere Entscheidungsvorbereitung in geeigneter Form
aufzubereiten. Im vorliegenden Beitrag werden Berührungspunkte
zwischen Data Mining und Marketing diskutiert und der konkrete
Einsatz ausgewählter Data-Mining-Methoden am Beispiel der
explorativen Warenkorb- bzw. Sortimentsverbundanalyse für einen
Transaktionsdatensatz aus dem Lebensmitteleinzelhandel demonstriert.
Zur Anwendung gelangen dabei Techniken aus dem Bereich der
klassischen Affinitätsanalyse, ein K-Medoid-Verfahren
der Clusteranalyse sowie Werkzeuge zur Generierung und
anschließenden Beurteilung von Assoziationsregeln zwischen im
Sortiment enthaltenen Warengruppen. Die Vorgehensweise wird dabei
anhand des mit der Statistik-Software R frei verfügbaren
Erweiterungspakets arules illustriert.
|
|
[7]
|
Michael Hahsler.
A model-based frequency constraint for mining associations from
transaction data.
Data Mining and Knowledge Discovery, 13(2):137-166, September
2006.
[ bib |
DOI |
at the publisher |
.pdf ]
Mining frequent itemsets is a popular method for finding associated
items in databases. For this method, support, the co-occurrence frequency
of the items which form an association, is used as the primary indicator
of the associations's significance. A single user-specified support
threshold is used to decided if associations should be further investigated.
Support has some known problems with rare items, favors shorter itemsets
and sometimes produces misleading associations. In this paper we
develop a novel model-based frequency constraint as an alternative
to a single, user-specified minimum support. The constraint utilizes
knowledge of the process generating transaction data by applying
a simple stochastic mixture model (the NB model) which allows for
transaction data's typically highly skewed item frequency distribution.
A user-specified precision threshold is used together with the model
to find local frequency thresholds for groups of itemsets. Based
on the constraint we develop the notion of NB-frequent itemsets and
adapt a mining algorithm to find all NB-frequent itemsets in a database.
In experiments with publicly available transaction databases we show
that the new constraint provides improvements over a single minimum
support threshold and that the precision threshold is more robust
and easier to set and interpret by the user.
|
|
[8]
|
Michael Hahsler and Kurt Hornik.
New probabilistic interest measures for association rules.
Report 38, Research Report Series, Department of Statistics and
Mathematics, Wirtschaftsuniversität Wien, Augasse 2-6, 1090 Wien,
Austria, August 2006.
[ bib |
at the publisher ]
Mining association rules is an important technique for discovering
meaningful patterns in transaction databases. Many different measures
of interestingness have been proposed for association rules. However,
these measures fail to take the probabilistic properties of the mined
data into account. In this paper, we start with presenting a simple
probabilistic framework for transaction data which can be used to
simulate transaction data when no associations are present. We use
such data and a real-world database from a grocery outlet to explore
the behavior of confidence and lift, two popular interest measures
used for rule mining. The results show that confidence is systematically
influenced by the frequency of the items in the left hand side of
rules and that lift performs poorly to filter random noise in transaction
data. Based on the probabilistic framework we develop two new interest
measures, hyper-lift and hyper-confidence, which can be used to filter
or order mined association rules. The new measures show significant
better performance than lift for applications where spurious rules
are problematic.
|
|
[9]
|
Michael Hahsler, Kurt Hornik, and Thomas Reutterer.
Warenkorbanalyse mit Hilfe der Statistik-Software R.
In Peter Schnedlitz, Renate Buber, Thomas Reutterer, Arnold Schuh,
and Christoph Teller, editors, Innovationen in Marketing, pages
144-163. Linde-Verlag, 2006.
[ bib |
at the publisher |
.pdf ]
Die Warenkorb- oder Sortimentsverbundanalyse bezeichnet eine Reihe
von Methoden zur Untersuchung der bei einem Einkauf gemeinsam nachgefragten
Produkte oder Kategorien aus einem Handelssortiment. In diesem Beitrag
wird die explorative Warenkorbanalyse näher beleuchtet, welche eine
Verdichtung und kompakte Darstellung der in (zumeist sehr umfangreichen)
Transaktionsdaten des Einzelhandels auffindbaren Verbundbeziehungen
beabsichtigt. Mit einer enormen Anzahl an verfügbaren Erweiterungspaketen
bietet sich die frei verfügbare Statistik-Software R als ideale Basis
für die Durchführung solcher Warenkorbanalysen an. Die im Erweiterungspaket
arules vorhandene Infrastruktur für Transaktionsdaten stellt eine
flexible Basis für die Warenkorbanalyse bereit. Unterstützt wird
die effiziente Darstellung, Bearbeitung und Analyse von Warenkorbdaten
mitsamt beliebigen Zusatzinformationen zu Produkten (zum Beispiel
Sortimentshierarchie) und zu Transaktionen (zum Beispiel Umsatz oder
Deckungsbeitrag). Das Paket ist nahtlos in R integriert und ermöglicht
dadurch die direkte Anwendung von bereits vorhandenen modernsten
Verfahren für Sampling, Clusterbildung und Visualisierung von Warenkorbdaten.
Zusätzlich sind in arules gängige Algorithmen zum Auffinden von Assoziationsregeln
und die notwendigen Datenstrukturen zur Analyse von Mustern vorhanden.
Eine Auswahl der wichtigsten Funktionen wird anhand eines realen
Transaktionsdatensatzes aus dem Lebensmitteleinzelhandel demonstriert.
|
|
[10]
|
Michael Hahsler, Kurt Hornik, and Thomas Reutterer.
Implications of probabilistic data modeling for mining association
rules.
In M. Spiliopoulou, R. Kruse, C. Borgelt, A. Nürnberger, and
W. Gaul, editors, From Data and Information Analysis to Knowledge
Engineering, Proceedings of the 29th Annual Conference of the Gesellschaft
für Klassifikation e.V., University of Magdeburg, March 9-11, 2005,
Studies in Classification, Data Analysis, and Knowledge Organization, pages
598-605. Springer-Verlag, 2006.
[ bib |
at the publisher |
.pdf ]
Mining association rules is an important technique for discovering
meaningful patterns in transaction databases. In the current literature,
the properties of algorithms to mine association rules are discussed
in great detail. We present a simple probabilistic framework for
transaction data which can be used to simulate transaction data when
no associations are present. We use such data and a real-world grocery
database to explore the behavior of confidence and lift, two popular
interest measures used for rule mining. The results show that confidence
is systematically influenced by the frequency of the items in the
left-hand-side of rules and that lift performs poorly to filter random
noise in transaction data. The probabilistic data modeling approach
presented in this paper not only is a valuable framework to analyze
interest measures but also provides a starting point for further
research to develop new interest measures which are based on statistical
tests and geared towards the specific properties of transaction data.
|
|
[11]
|
Michael Hahsler, Bettina Grün, and Kurt Hornik.
arules - A computational environment for mining association rules
and frequent item sets.
Journal of Statistical Software, 14(15):1-25, October 2005.
[ bib |
at the publisher |
.pdf ]
Mining frequent itemsets and association rules is a popular and well
researched approach for discovering interesting relationships between
variables in large databases. The R package arules presented in this
paper provides a basic infrastructure for creating and manipulating
input data sets and for analyzing the resulting itemsets and rules.
The package also includes interfaces to two fast mining algorithms,
the popular C implementations of Apriori and Eclat by Christian Borgelt.
These algorithms can be used to mine frequent itemsets, maximal frequent
itemsets, closed frequent itemsets and association rules.
|
|
[12]
|
Michael Hahsler, Bettina Grün, and Kurt Hornik.
A computational environment for mining association rules and frequent
item sets.
Report 15, Research Report Series, Department of Statistics and
Mathematics, Wirtschaftsuniversität Wien, Augasse 2-6, 1090 Wien,
Austria, April 2005.
[ bib |
at the publisher ]
Mining frequent itemsets and association rules is a popular and well
researched approach to discovering interesting relationships between
variables in large databases. The R package arules presented in this
paper provides a basic infrastructure for creating and manipulating
input data sets and for analyzing the resulting itemsets and rules.
The package also includes interfaces to two fast mining algorithms,
the popular C implementations of Apriori and Eclat by Christian Borgelt.
These algorithms can be used to mine frequent itemsets, maximal frequent
itemsets, closed frequent itemsets and association rules.
|
|
[13]
|
Michael Hahsler, Kurt Hornik, and Thomas Reutterer.
Implications of probabilistic data modeling for rule mining.
Report 14, Research Report Series, Department of Statistics and
Mathematics, Wirtschaftsuniversität Wien, Augasse 2-6, 1090 Wien,
Austria, March 2005.
[ bib |
at the publisher ]
Mining association rules is an important technique for discovering
meaningful patterns in transaction databases. In the current literature,
the properties of algorithms to mine associations are discussed in
great detail. In this paper we investigate properties of transaction
data sets from a probabilistic point of view. We present a simple
probabilistic framework for transaction data and its implementation
using the R statistical computing environment. The framework can
be used to simulate transaction data when no associations are present.
We use such data to explore the ability to filter noise of confidence
and lift, two popular interest measures used for rule mining. Based
on the framework we develop the measure hyperlift and we compare
this new measure to lift using simulated data and a real-world grocery
database.
|
|
[14]
|
Michael Hahsler.
A model-based frequency constraint for mining associations from
transaction data.
Working Paper 07/2004, Working Papers on Information Processing and
Information Management, Institut für Informationsverarbeitung und
-wirtschaft, Wirtschaftsuniversität Wien, Augasse 2-6, 1090 Wien,
Austria, November 2004.
[ bib |
at the publisher ]
In this paper we develop an alternative to minimum support which
utilizes knowledge of the process which generates transaction data
and allows for highly skewed frequency distributions. We apply a
simple stochastic model (the NB model), which is known for its usefulness
to describe item occurrences in transaction data, to develop a frequency
constraint. This model-based frequency constraint is used together
with a precision threshold to find individual support thresholds
for groups of associations. We develop the notion of NB-frequent
itemsets and present two mining algorithms which find all NB-frequent
itemsets in a database. In experiments with publicly available transaction
databases we show that the new constraint can provide significant
improvements over a single minimum support threshold and that the
precision threshold is easier to use.
|
|
[15]
|
Andreas Geyer-Schulz and Michael Hahsler.
Comparing two recommender algorithms with the help of recommendations
by peers.
In O.R. Zaiane, J. Srivastava, M. Spiliopoulou, and B. Masand,
editors, WEBKDD 2002 - Mining Web Data for Discovering Usage Patterns
and Profiles 4th International Workshop, Edmonton, Canada, July 2002, Revised
Papers, Lecture Notes in Computer Science LNAI 2703, pages 137-158.
Springer-Verlag, 2003.
(Revised version of the WEBKDD 2002 paper “Evaluation of Recommender
Algorithms for an Internet Information Broker based on Simple Association
Rules and on the Repeat-Buying Theory”).
[ bib |
at the publisher |
.pdf ]
Since more and more Web sites, especially sites of retailers, offer
automatic recommendation services using Web usage mining, evaluation
of recommender algorithms has become increasingly important. In this
paper we present a framework for the evaluation of different aspects
of recommender systems based on the process of discovering knowledge
in databases introduced by Fayyad et al. and we summarize research
already done in this area. One aspect identified in the presented
evaluation framework is widely neglected when dealing with recommender
algorithms. This aspect is to evaluate how useful patterns extracted
by recommender algorithms are to support the social process of recommending
products to others, a process normally driven by recommendations
by peers or experts. To fill this gap for recommender algorithms
based on frequent itemsets extracted from usage data we evaluate
the usefulness of two algorithms. The first recommender algorithm
uses association rules, and the other algorithm is based on the repeat-buying
theory known from marketing research. We use 6 months of usage data
from an educational Internet information broker and compare useful
recommendations identified by users from the target group of the
broker (peers) with the recommendations produced by the algorithms.
The results of the evaluation presented in this paper suggest that
frequent itemsets from usage histories match the concept of useful
recommendations expressed by peers with satisfactory accuracy (higher
than 70%) and precision (between 60% and 90%). Also the evaluation
suggests that both algorithms studied in the paper perform similar
on real-world data if they are tuned properly.
|
|
[16]
|
Andreas Geyer-Schulz and Michael Hahsler.
Evaluation of recommender algorithms for an internet information
broker based on simple association rules and on the repeat-buying theory.
In Brij Masand, Myra Spiliopoulou, Jaideep Srivastava, and Osmar R.
Zaiane, editors, Fourth WEBKDD Workshop: Web Mining for Usage Patterns
& User Profiles, pages 100-114, Edmonton, Canada, July 2002.
[ bib |
.pdf ]
Association rules are a widely used technique to generate recommendations
in commercial and research recommender systems. Since more and more
Web sites, especially of retailers, offer automatic recommender services
using Web usage mining, evaluation of recommender algorithms becomes
increasingly important. In this paper we first present a framework
for the evaluation of different aspects of recommender systems based
on the process of discovering knowledge in databases of Fayyad et
al. and then we focus on the comparison of the performance of two
recommender algorithms based on frequent itemsets. The first recommender
algorithm uses association rules, and the other recommender algorithm
is based on the repeat-buying theory known from marketing research.
For the evaluation we concentrated on how well the patterns extracted
from usage data match the concept of useful recommendations of users.
We use 6 month of usage data from an educational Internet information
broker and compare useful recommendations identified by users from
the target group of the broker with the results of the recommender
algorithms. The results of the evaluation presented in this paper
suggest that frequent itemsets from purchase histories match the
concept of useful recommendations expressed by users with satisfactory
accuracy (higher than 70%) and precision (between 60% and 90%).
Also the evaluation suggests that both algorithms studied in the
paper perform similar on real-world data if they are tuned properly.
|