Ordered logit analysis for selectively sampled data

https://doi.org/10.1016/S0167-9473(02)00063-4Get rights and content

Abstract

If individuals are classified into ordered categories, which are defined from the outset, it may happen that the majority belong to a single category. If a market researcher is interested in the correlation between the classification and individual characteristics, then the natural question is whether one needs to collect data for all individuals in that particular category. This question is dealt within the context of the ordered logit model. It is shown that there is no need to consider all those individuals and that a fraction of the individuals is needed. All that is required is a simple modification of the log likelihood, which is based on Bayes’ theorem. The proposed method is illustrated using simulated data and data concerning risk profiles of customers of an investment bank.

Introduction

For marketing purposes it is often of interest to classify individuals into various market segments. For example, an investment firm, with access to a large database with information on characteristics of its customers and their past investment behavior, may want to classify its customers according to their risk profile. Usually, these risk profiles are defined from the outset as discrete categories with a certain ordering. More formally, they form a finite-ordered partition into exhaustive and non-overlapping subsets. Assuming that there are J such categories, category 1 can contain the individuals with the least risky asset portfolios while categories 2 to J contain investors holding increasingly more risky portfolios. An investment firm may now be interested in examining possible links between the characteristics of a customer and his or her risk profile. Information on the relation between customer characteristics and its risk profile can, in general, be used to gain insight into the behavior of customers. For example, this information can be used to select individuals who are expected to be interested in a new investment product. As the response scale is ordinal, one usually has to rely on an ordered regression model.

The database of an investment firm can be very large as it contains a host of information on all its customers. On the other hand, it may well be that only a few individuals fall into one specific category. For example, suppose that there are N individuals and J=3 categories with N1, N2 and N3 customers, respectively. Customers in categories 1 and 3 may now be substantially outnumbered by the individuals in category 2, holding portfolios with an average risk. A natural question is then whether one needs to include all N2 individuals in the analysis of the risk categories. Indeed, if only a fraction of N2 will suffice, one would save much time and effort, as not all data have to be collected, checked for errors, and stored. Note that the selection is made conditional on the classification.

The main focus of this paper is on developing an adjusted estimation technique that allows for correct inference for selectively sampled data. In this paper we choose to use an ordered logit model as point of departure. Using artificial data as well as data concerning the above-mentioned risk-profile example, we show that there is indeed no need to collect information on all N2 individuals, and that any fraction will do. A modification of the likelihood function will give similar inference for both cases, that is, both estimates refer to the same parameters. In practice they do not differ substantially.

The outline of this paper is as follows. In Section 2, we briefly discuss some essentials of the ordered logit model. In Section 3, we put forward the modification of the log likelihood, which allows for selective sampling from a large number of individuals who all would be classified into the same category. In Section 4, we evaluate our modification concerning selective sampling in a limited simulation experiment. In Section 5, we apply our method to risk profile data from a large Dutch investment bank. In Section 6, we conclude with some remarks.

Section snippets

Representation

Consider the following discrete dependent variableyi,j=1ifobservationibelongstocategoryj,0otherwise,i=1,…,N,j=1,…,J,where j can be thought of as a risk profile class, such that j=1 corresponds with customers holding low-risk asset portfolios. Assume that there is a latent variable yi, which can be modeled asyiTxii,εiLogisticthat is, yi can be explained by explanatory variables contained in the K×1 vector xi. The effect of xi on yi is measured by the K×1 vector β. The unexplained part

Selective sampling and the ordered logit model

In this section we examine how we have to modify the log likelihood to obtain proper estimates in case we use selectively sampled data. Consider a situation where, in the population, one class of individuals outnumbers all other classes. Furthermore, consider a researcher interested in estimating a model for a (ordered) classification of individuals. In many cases it is easy to observe the classification of a specific individual. However, in general it is more difficult and time consuming to

Simulation results

To evaluate the practical usefulness of our method, which involves a correction of the likelihood function, and to check the accuracy of the estimated asymptotic variance calculated from a subset of the sample, we consider a simulation experiment. This experiment is not directed at presenting properties of the maximum likelihood estimator for the ordered logit model. We conduct a limited simulation study to show that the properties of the estimator are not distorted by the data reduction. To

An application to risk profiles

In this section we illustrate the method of selective sampling using the adjusted log likelihood for real-life data. Our potential data set consists of 41,582 customers of an investment bank, who are classified by this bank as having a low, middle- or high-risk profile, where these profiles are indicated by the type of asset portfolio an individual currently holds. For example, a customer having options will be assigned to a high-risk profile. Again, we have no specific information as to which

Concluding remarks

We proposed a simple modification to the log likelihood of an ordered logit model, which enables a researcher to (randomly) select a subset of data from a category containing substantially more observations than other categories have. Through Monte Carlo simulations and an analysis of real-life data, we showed that our method results in unbiased estimates and that the estimated standard deviations do not increase to a large extent. Our method is useful for practical purposes as one may save on

Acknowledgements

We thank the co-editor and two reviewers of CSDA for useful comments and Richard Paap for helpful discussions. We are specifically thankful to J.S. Cramer for bringing the topic to our attention. Furthermore, we thank Rabobank Nederland for providing us with the data.

References (6)

  • M.R. Veall et al.

    Performance measures from prediction-realization tables

    Econom. Lett.

    (1992)
  • Cramer, M., Franses, P.H., Slagter, E., 1999. Censored regression analysis in large samples with very many zero...
  • Doornik, J.A., 1999. Object-Oriented Matrix Programming Using Ox, 3rd Edition. Timberlake Consultants Press and Oxford,...
There are more references available in the full text version of this article.

A computer program, which was used for all calculations in this paper, can be obtained from Fok.

View full text