R. Potharst (Rob)
http://repub.eur.nl/ppl/2906/
List of Publicationsenhttp://repub.eur.nl/eur_logo.png
http://repub.eur.nl/
RePub, Erasmus University RepositoryThe AUK: A simple alternative to the AUC
http://repub.eur.nl/pub/76748/
Wed, 01 Aug 2012 00:00:01 GMT<div>U. Kaymak</div><div>A. Ben-David</div><div>R. Potharst</div>
The area under the receiver operating characteristic (ROC) curve, also known as the AUC-index, is commonly used for ranking the performance of data mining models. The AUC has various merits, such as ease of interpretation. However, since it is class indifferent, its usefulness while dealing with highly skewed data sets is questionable. In this paper, we propose a simple alternative scalar measure to the AUC-index, the area under the Kappa curve (AUK). The proposed AUK-index compensates for the class indifference of the AUC by being sensitive to the class distribution. Therefore, it is particularly suitable for measuring classifiers performance on skewed data sets. After introducing the AUK we explore its mathematical relationship with the AUC and show that there is a non-linear relation between them.Dilworth's Theorem Revisited, an Algorithmic Proof
http://repub.eur.nl/pub/23112/
Wed, 27 Apr 2011 00:00:01 GMT<div>W.H.L.M. Pijls</div><div>R. Potharst</div>
Dilworth's theorem establishes a link between a minimal path cover and a maximal antichain in a digraph.
A new proof for Dilworth's theorem is given. Moreover an algorithm to find both the path cover and the antichain, as considered in the theorem, is presented.AUK: a simple alternative to the AUC
http://repub.eur.nl/pub/19678/
Tue, 01 Jun 2010 00:00:01 GMT<div>U. Kaymak</div><div>A. Ben-David</div><div>R. Potharst</div>
The area under Receiver Operating Characteristic (ROC) curve, also known as the AUC-index, is commonly used for ranking the performance of data mining models. The AUC has many merits, such as objectivity and ease of interpretation. However, since it is class indifferent, its usefulness while dealing with highly skewed data sets is questionable, to say the least. In this paper, we propose a simple alternative scalar measure to the AUCindex, the Area Under an Kappa curve (AUK). The proposed AUK-index compensates for the above basic flaw of the AUC by being sensitive to the class distribution. Therefore it is particularly suitable for measuring classifiersâ€™ performance on skewed data sets. After introducing the AUK we explore its mathematical relationship with the AUC and show that there is a nonlinear relation between them.Two algorithms for generating structured and unstructured monotone ordinal data sets
http://repub.eur.nl/pub/17537/
Mon, 01 Jun 2009 00:00:01 GMT<div>R. Potharst</div><div>M.C. van Wezel</div><div>A. Ben-David</div>
Monotone constraints are very common while dealing with multi-attribute ordinal problems. Grinding wheels hardness selection, timely replacements of costly laser sensors in silicon wafer manufacturing, and the selection of the right personnel for sensitive production facilities, are just a few examples of ordinal problems where monotonicity makes sense.
In order to evaluate the performance of various ordinal classifiers one needs both artificially generated as well as real world data sets. Two algorithms are presented for generating monotone ordinal data sets. The first can be used for generating random monotone ordinal data sets without an underlying structure. The second algorithm, which is the main contribution of this paper, describes for the first time how structured monotone data sets can be generated.A support system for predicting eBay end prices
http://repub.eur.nl/pub/13581/
Sat, 01 Mar 2008 00:00:01 GMT<div>D.P. van Heijst</div><div>R. Potharst</div><div>M.C. van Wezel</div>
We create a support system for predicting end prices on eBay. The end price predictions are based on the item descriptions found in the item listings of eBay, and on some numerical item features. The system uses text mining and boosting algorithms from the field of machine learning. Our system substantially outperforms the naive method of predicting the category mean price. Moreover, interpretation of the model enables us to identify influential terms in the item descriptions and shows that the item description is more influential than the seller feedback rating, which was shown to be influential in earlier studies.Improved customer choice predictions using ensemble methods
http://repub.eur.nl/pub/64407/
Thu, 16 Aug 2007 00:00:01 GMT<div>M.C. van Wezel</div><div>R. Potharst</div>
In this paper various ensemble learning methods from machine learning and statistics are considered and applied to the customer choice modeling problem. The application of ensemble learning usually improves the prediction quality of flexible models like decision trees and thus leads to improved predictions. We give experimental results for two real-life marketing datasets using decision trees, ensemble versions of decision trees and the logistic regression model, which is a standard approach for this problem. The ensemble models are found to improve upon individual decision trees and outperform logistic regression. Next, an additive decomposition of the prediction error of a model is considered known as the bias/variance decomposition. A model with a high bias lacks the flexibility to fit the data well. A high variance indicates that a model is instable with respect to different datasets. Decision trees have a high variance component and a low bias component in the prediction error, whereas logistic regression has a high bias component and a low variance component. It is shown that ensemble methods aim at minimizing the variance component in the prediction error while leaving the bias component unaltered. Bias/variance decompositions for all models for both customer choice datasets are given to illustrate these concepts.A decision support system for direct mailing decisions
http://repub.eur.nl/pub/73508/
Wed, 01 Nov 2006 00:00:01 GMT<div>J-J. Jonker</div><div>N. Piersma</div><div>R. Potharst</div>
Direct marketing firms want to transfer their message as efficiently as possible in order to obtain a profitable long-term relationship with individual customers. Much attention has been paid to address selection of existing customers and on identifying new profitable prospects. Less attention has been paid to the optimal frequency of the contacts with customers. We provide a decision support system that helps the direct mailer to determine mailing frequency for active customers. The system observes the mailing pattern of these customers in terms of the well-known R(ecency), F(requency) and M(onetary) variables. The underlying model is based on an optimization model for the frequency of direct mailings. The system provides the direct mailer with tools to define preferred response behavior and advises the direct mailer on the mailing strategy that will steer the customers towards this preferred response behavior.A support system for predicting eBay end prices.
http://repub.eur.nl/pub/8189/
Wed, 25 Oct 2006 00:00:01 GMT<div>D.P. van Heijst</div><div>R. Potharst</div><div>M.C. van Wezel</div>
In this report a support system for predicting end prices on eBay is
proposed. The end price predictions are based on the item descriptions found in
the item listings of eBay, and on some numerical item features.
The system uses text mining and boosting algorithms from the
field of machine learning.
Our system substantially outperforms the naive method of
predicting the category mean price. Moreover, interpretation of
the model enables us to identify influential terms in the item
descriptions and shows that the item description is more
influential than the seller feedback rating, which was shown to be
influential in earlier studies.Boosting the accuracy of hedonic pricing models
http://repub.eur.nl/pub/7145/
Fri, 02 Dec 2005 00:00:01 GMT<div>M.C. van Wezel</div><div>M. Kagie</div><div>R. Potharst</div>
Hedonic pricing models attempt to model a relationship between object attributes and
the object's price. Traditional hedonic pricing models are often parametric models that suffer
from misspecification. In this paper we create these models by means of boosted CART
models. The method is explained in detail and applied to various datasets. Empirically,
we find substantial reduction of errors on out-of-sample data for two out of three datasets
compared with a stepwise linear regression model. We interpret the boosted models by partial
dependence plots and relative importance plots. This reveals some interesting nonlinearities
and differences in attribute importance across the model types.Generating artificial data with monotonicity constraints
http://repub.eur.nl/pub/1916/
Fri, 11 Mar 2005 00:00:01 GMT<div>R. Potharst</div><div>M.C. van Wezel</div>
The monotonicity constraint is a common side condition imposed on
modeling problems as diverse as hedonic pricing, personnel
selection and credit rating. Experience tells us that it is not
trivial to generate artificial data for supervised learning
problems when the monotonicity constraint holds. Two algorithms
are presented in this paper for such learning problems. The first
one can be used to generate random monotone data sets without an
underlying model, and the second can be used to generate monotone
decision tree models. If needed, noise can be added to the
generated data. The second algorithm makes use of the first one.
Both algorithms are illustrated with an example.Modeling brand choice using boosted and stacked neural networks
http://repub.eur.nl/pub/1911/
Thu, 10 Mar 2005 00:00:01 GMT<div>R. Potharst</div><div>M. van Rijthoven</div><div>M.C. van Wezel</div>
The brand choice problem in marketing has recently been addressed with methods
from computational intelligence such as neural networks. Another class of methods
from computational intelligence, the so-called ensemble methods such as boosting
and stacking have never been applied to the brand choice problem, as far as we know.
Ensemble methods generate a number of models for the same problem using any base
method and combine the outcomes of these different models. It is well known that
in many cases the predictive performance of ensemble methods significantly exceeds
the predictive performance of the their base methods. In this report we use boosting
and stacking of neural networks and apply this to a scanner dataset that is a benchmark
dataset in the marketing literature. Using these methods, we find a significant
improvement in predictive performance on this dataset.Improved customer choice predictions using ensemble methods
http://repub.eur.nl/pub/1943/
Tue, 08 Mar 2005 00:00:01 GMT<div>M.C. van Wezel</div><div>R. Potharst</div>
In this paper various ensemble learning methods from machine
learning and statistics are considered and applied to the customer
choice modeling problem. The application of ensemble learning
usually improves the prediction quality of flexible models like
decision trees and thus leads to improved predictions. We give
experimental results for two real-life marketing datasets using
decision trees, ensemble versions of decision trees and the
logistic regression model, which is a standard approach for this
problem. The ensemble models are found to improve upon individual
decision trees and outperform logistic regression.
Next, an additive decomposition of the prediction error of a
model, the bias/variance decomposition, is considered. A model
with a high bias lacks the flexibility to fit the data well. A
high variance indicates that a model is instable with respect to
different datasets. Decision trees have a high variance component
and a low bias component in the prediction error, whereas logistic
regression has a high bias component and a low variance component.
It is shown that ensemble methods aim at minimizing the variance
component in the prediction error while leaving the bias component
unaltered. Bias/variance decompositions for all models for both
customer choice datasets are given to illustrate these concepts.Direct Mailing Decisions for a Dutch Fundraiser
http://repub.eur.nl/pub/260/
Mon, 02 Dec 2002 00:00:01 GMT<div>J-J. Jonker</div><div>N. Piersma</div><div>R. Potharst</div>
Direct marketing firms want to transfer their message as efficiently
as possible in order to obtain a profitable long-term relationship
with individual customers. Much attention has been paid to address
selection of existing customers and on identifying new profitable
prospects. Less attention has been paid to the optimal frequency of
the contacts with customers. We provide a decision support system that
helps the direct mailer to determine mailing frequency for active
customers. The system observes the mailing pattern of these customers
in terms of the well known R(ecency), F(requency) and M(onetary)
variables. The underlying model is based on an optimization model for
the frequency of direct mailings. The system provides the direct
mailer with tools to define preferred response behavior and advises
the direct mailer on the mailing strategy that will steer the
customers towards this preferred response behavior.Classification Trees for Problems with Monotonicity Constraints
http://repub.eur.nl/pub/195/
Tue, 23 Apr 2002 00:00:01 GMT<div>R. Potharst</div><div>A.J. Feelders</div>
For classification problems with ordinal attributes very often the
class attribute should increase with each or some of the
explaining attributes. These are called classification problems
with monotonicity constraints. Classical decision tree algorithms
such as CART or C4.5 generally do not produce monotone trees, even
if the dataset is completely monotone. This paper surveys the
methods that have so far been proposed for generating decision
trees that satisfy monotonicity constraints. A distinction is made
between methods that work only for monotone datasets and methods
that work for monotone and non-monotone datasets alike.Pattern-Based Target Selection Applied to Fund Raising
http://repub.eur.nl/pub/117/
Thu, 18 Oct 2001 00:00:01 GMT<div>W.H.L.M. Pijls</div><div>R. Potharst</div><div>U. Kaymak</div>
This paper proposes a new algorithm for target selection. This
algorithm collects all frequent patterns (equivalent to frequent item
sets) in a training set. These patterns are stored e?ciently using a
compact data structure called a trie. For each pattern the relative
frequency of the target class is determined. Target selection is achieved
by matching the candidate records with the patterns in the trie. A
score for each record results from this matching process, based upon
the frequency values in the trie. The records with the best score values
are selected. We have applied the new algorithm to a large data set
containing the results of a number of mailing campaigns by a Dutch
charity organization. Our algorithm turns out to be competitive with
logistic regression and superior to CHAID.Neural Networks for Target Selection in Direct Marketing
http://repub.eur.nl/pub/83/
Thu, 29 Mar 2001 00:00:01 GMT<div>R. Potharst</div><div>U. Kaymak</div><div>W.H.L.M. Pijls</div>
Partly due to a growing interest in direct marketing, it has become an important application field for data mining. Many techniques have been applied to select the targets in commercial applications, such as statistical regression, regression trees, neural computing, fuzzy clustering and association rules. Modeling of charity donations has also recently been considered. The availability of a large number of techniques for analyzing the data may look overwhelming and ultimately unnecessary at first. However, the amount of data used in direct marketing is tremendous. Further, there are different types of data and likely strong nonlinear relations amongst different groups within the data. Therefore, it is unlikely that there will be a single method that can be used under all circumstances. For that reason, it is important to have access to a range of different target selection methods that can be used in a complementary fashion. In this respect, learning systems such as neural networks have the advantage that they can adapt to the nonlinearity in the data to capture the complex relations. This is an important motivation for applying neural networks for target selection. In this report, neural networks are applied to target selection in modeling of charity donations. Various stages of model building are described by using data from a large Dutch charity organization as a case. The results are compared with the results of more traditional methods for target selection such as logistic regression and CHAID.Classification and Target Group Selection Based Upon Frequent Patterns
http://repub.eur.nl/pub/50/
Fri, 20 Oct 2000 00:00:01 GMT<div>W.H.L.M. Pijls</div><div>R. Potharst</div>
In this technical report , two new algorithms based upon frequent patterns are proposed. One algorithm is a classification method. The other one is an algorithm for target group selection. In both algorithms, first of all, the collection of frequent patterns in the training set is constructed. Choosing an appropriate data structure allows us to keep the full collection of frequent patterns in memory. The classification method utilizes directly this collection. Target group selection is a known problem in direct marketing. Our selection algorithm is based upon the collection of frequent patterns.Quasi-monotone decision trees for ordinal classification
http://repub.eur.nl/pub/446/
Thu, 01 Jan 1998 00:00:01 GMT<div>R. Potharst</div><div>J.C. Bioch</div><div>R. van Dordregt</div>
In many classification problems the domains of the attributes and the classes are linearly ordered. Since the known decision tree methods generate non-monotone trees, these methods are not suitable for monotone classification problems. We already provided order-preserving tree-generation algorithms for multi-attribute classification problems with k linearly ordered classes in a previous paper. For real-world datasets it is important to consider approximate solutions to handle problems like speed, tree-size and noise. In this report we develop a new decision tree algorithm that generates quasi-monotone decision trees. This algorithm outperforms classical algorithms such as those of Quinlan with respect to prediction, and beats algorithms that generate strictly monotone decision trees with respect to speed. This report contains proofs of all presented results.Monotone Decision Trees
http://repub.eur.nl/pub/522/
Wed, 01 Jan 1997 00:00:01 GMT<div>J.C. Bioch</div><div>T. Petter</div><div>R. Potharst</div>
EUR-FEW-CS-97-07 Title Monotone decision trees Author(s) R. Potharst J.C. Bioch T. Petter Abstract In many classification problems the domains of the attributes and the classes are linearly ordered. Often, classification must preserve this ordering: this is called monotone classification. Since the known decision tree methods generate non-monotone trees, these methods are not suitable for monotone classification problems. In this report we provide a number of order-preserving tree-generation algorithms for multi-attribute classification problems with k linearly ordered classes.Bivariate decision trees
http://repub.eur.nl/pub/458/
Mon, 01 Jan 1996 00:00:01 GMT<div>J.C. Bioch</div><div>O. van der Meer</div><div>R. Potharst</div>
Decision trees with tests based on a single variable, as produced by methods such as ID3, C4.5 etc., often require a large number of tests to achieve an acceptable accuracy. This makes interpretation of these trees, which is an important reason for their use, disputable. Recently, a number of methods for constructing decision trees with multivariate tests have been presented. Multivariate decision trees are often smaller and more accurate than univariate trees; however, the use of linear combinations of the variables may result in trees that are hard to interpret. In this paper we consider trees with test based on combinations of at most two variables. We show that bivariate decision trees are an interesting alternative to both uni- and multivariate trees.