Explaining individual response using aggregated data

https://doi.org/10.1016/j.jeconom.2008.05.008Get rights and content

Abstract

Empirical analysis of individual response behavior is sometimes limited due to the lack of explanatory variables at the individual level. In this paper we put forward a new approach to estimate the effects of covariates on individual response, where the covariates are unknown at the individual level but observed at some aggregated level. This situation may, for example, occur when the response variable is available at the household level but covariates only at the zip-code level.

We describe the missing individual covariates by a latent variable model which matches the sample information at the aggregate level. Parameter estimates can be obtained using maximum likelihood or a Bayesian analysis. We illustrate the approach estimating the effects of household characteristics on donating behavior to a Dutch charity. Donating behavior is observed at the household level, while the covariates are only observed at the zip-code level.

Introduction

Empirical analysis of individual behavior is sometimes limited due to the lack of explanatory variables at the individual level. There may be various reasons why individual-level explanatory variables are not available. When using individual revealed preference data, information about explanatory variables may simply not be available as databases cannot be properly linked. For survey data, there may be a missing explanatory variable due to a missing question in the survey or a question which is interpreted the wrong way by the respondents.

In some cases it is possible to obtain information on explanatory variables at some aggregated level. For example, if the zip code of households is known, it may be possible to obtain aggregated information on household characteristics, like income and family size, at the zip-code level. This zip-code level information is usually obtained through surveys. The aggregated information of the variables is summarized in marginal probabilities which reflect the probability that the explanatory variable lies in some interval (income, age) or category (gender, religion) for a household in that zip-code region.

The goal of the current paper is to estimate the effects of covariates on individual response when the covariates are unobserved at the individual level but observed at some aggregated level. There are several studies in economics which try to link individual and aggregated data, see, for example, Imbens and Lancaster (1994) and van der Klaauw (2001). In contrast to our situation, these studies assume that both individual-level data and aggregated data is available. The aggregated data is assumed to be more reliable and is used to put restrictions on the individual-level data. The situation of missing individual covariates is related to ecological inference; see Wakefield (2004) for an overview. The main difference with regular ecological inference problems is that we observe individual responses, whereas ecological inference also relies on aggregated information on the response variable. The extra information on individual responses may help us overcome certain identification issues in ecological inference.

As far as we know, the only paper that comes close to our situation is Steenburgh et al. (2003). The motivation for the use of aggregated data in this paper is, however, different from ours. The authors use zip-code information to describe unobserved heterogeneity in the individual behavior of households instead of estimating the effects of covariates on behavior. Our problem also bears similarities with symbolic data analysis, see Billard and Diday (2003) for an overview. Symbolic data analysis also deals with aggregated explanatory variables and dependent variables at an individual level. The motivation for the use of aggregated data is however different. Aggregation is used to summarize large datasets. Therefore, the form of the aggregated information is different and represents, for example, intervals instead of marginal probabilities.

In this paper we develop a new approach to estimate the effects of covariates on individual response when the covariates are unknown at the individual level but observed at some aggregated level in the form of marginal probabilities. We extend the standard individual response model with a latent variable model describing the missing explanatory variables. This latent variable model describes the missing explanatory variables in such a way that it matches the sample information at the aggregated level. In the case of one explanatory variable, the model simplifies to a standard mixture regression with known mixing proportions. A simple simulation experiment shows that this new approach outperforms in efficiency the standard method, where we replace the missing explanatory variables by the observed marginal probabilities at the aggregated level.

Parameter estimates of the individual response model can be obtained using Simulated Maximum Likelihood [SML] or a Bayesian approach. Given the computational burden of SML, the latter approach may be more convenient. To obtain posterior results, we use a Gibbs sampler with data augmentation. The unobserved explanatory variables are sampled alongside the model parameters. Conditional on the sampled explanatory variables, a standard Markov Chain Monte Carlo [MCMC] sampler can be used for the model describing individual response.

The outline of the paper is as follows. In Section 2 we provide a simple introduction into the problem and perform a small simulation experiment to illustrates the merits of our approach. In Section 3 we generalize the discussion to a more general setting. Parameter estimation is discussed in Section 4. In Section 5 we illustrate our approach estimating the effects of household characteristics on donating behavior to a Dutch charity. We use aggregated information on household characteristics at the zip-code level to explain the individual response of households to a direct mailing by the charity. Finally, Section 6 concludes.

Section snippets

Preliminaries

To illustrate the benefits of our new approach, we start the discussion with a simple example. We consider a linear regression model with only one explanatory variable. The explanatory variable xi can only take the value 0 or 1, for example, a gender dummy. Let the observed response of individual i,yi, be described by yi=α+βxi+εi, where α is an intercept parameter and where β describes the effect of the 0/1 dummy variables xi on yi for i=1,,N. The error term εi is assumed to be normally

Model specification

In this section, we generalize the discussion in the previous section in several ways. First, we relax the assumption that the model for yi is a linear regression model. Secondly, we allow for m explanatory variables summarized in the m-dimensional vector Xi. Finally, we allow for other types of explanatory variables like ordered and unordered categorical variables. The vector of explanatory variables is written as Xi=(Xi(1),Xi(2),Xi(3)), where Xi(1) contains the binary explanatory

Parameter estimation

To estimate the model parameters of the model proposed in the previous section, we can choose for maximum likelihood or a Bayesian approach. In this section we discuss both approaches and their relative merits.

We first derive the likelihood function. Let the density function of the data yi for the model in (7) conditional on the missing variables Xi be given by f(yi|Xi;β,θ), where β and θ denote the model parameters. To derive the unconditional density of yi we have to sum over all possible

Application

To illustrate our approach, we consider in this section an application where we analyze the characteristics of households who donate to a large Dutch charity in the health sector. Households receive a direct mailing from the charity with a request to donate money. The household may not respond and donate nothing or respond and donate a positive amount. We have no information about the characteristics of the households apart from their zip code. At the zip-code level we know aggregated household

Conclusions

In this paper we have developed a new approach to estimate the effects of explanatory variables on individual response where the response variable is observed at the individual level but the explanatory variables are only observed at some aggregated level. This approach can, for example, be used if information about individual characteristics is only available at the zip-code level. To solve the limited data availability, we extend the model describing individual responses with a latent

Acknowledgements

We thank three anonymous reviewers, Dennis Fok, Philip Hans Franses, Rutger van Oest, Björn Vroomen and participants of seminars at the Institute of Advanced Studies in Vienna, the Université Catholique de Louvain in Louvain-la-Neuve, the NAKE research day in Amsterdam, Facultes Universitaires Notre-Dame de la Paix in Namur, and EEA/ESEM 2007 in Budapest for helpful comments. All estimation results are obtained using Ox version 4.00 (Doornik, 2002).

References (35)

  • A. Börsch-Supan et al.

    Smooth unbiased multivariate probability simulators for maximum likelihood estimation of limited dependent variable models

    Journal of Econometrics

    (1993)
  • G.J. van den Berg et al.

    Combining micro and macro unemployment duration data

    Journal of Econometrics

    (2001)
  • J. Aitchison et al.

    The generalization of probit analysis to the case of multiple responses

    Biometrika

    (1957)
  • T. Amemiya

    Bivariate probit analysis: Minimum chi-square methods

    Journal of the American Statistical Association

    (1974)
  • J.R. Ashford et al.

    Multi-variate probit analysis

    Biometrics

    (1970)
  • J. Barnard et al.

    Modeling covariance matrices in terms of standard deviations and correlations, with application to shrinkage

    Statistica Sinica

    (2000)
  • L. Billard et al.

    From the statistics of data to the statistics of knowledge: Symbolic data analysis

    Journal of the American Statistical Association

    (2003)
  • S. Chib et al.

    Analysis of multivariate probit models

    Biometrika

    (1998)
  • M.K. Cowles

    Accelerating Monte Carlo Markov chain convergence for cumulative-link generalized linear models

    Statistics and Computing

    (1996)
  • A.P. Dempster et al.

    Maximum Likelihood estimation from incomplete data via the EM algorithm

    Journal of the Royal Statistical Society, Series B

    (1977)
  • J.A. Doornik

    Object-Oriented Matrix Programming using Ox

    (2002)
  • B.S. Everitt et al.

    Finite Mixture Distributions

    (1981)
  • S. Geman et al.

    Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1984)
  • J. Geweke

    Efficient simulation from the multivariate normal and student-t distributions subject to linear constraints

  • J. Geweke

    Getting it right: Joint distribution tests of posterior simulators

    Journal of the American Statistical Association

    (2004)
  • J. Geweke

    Contemporary Bayesian Econometrics and Statistics

    (2005)
  • J. Geweke et al.

    Alternative computational approaches to inference in the multinomial probit model

    The Review of Economics and Statistics

    (1994)
  • View full text