Two types of single-peaked data: Correspondence analysis as an alternative to principal component analysis

https://doi.org/10.1016/j.csda.2008.09.010Get rights and content

Abstract

It has been argued that principal component analysis (PCA) is not appropriate for analyzing data conforming to single-peaked response models, also referred to as unfolding models. An overview of these findings is given, which relates them to the distinction between two types of unfolding models; namely, models that are either a quadratic function of the person-to-item distances or an exponential function of these distances. This distinction is easy to recognize empirically because the inter-item correlation matrix for the two types of data typically shows different patterns. Furthermore, for both types of unfolding models, correspondence analysis (CA), which is a rival method for dimensionality reduction, outperforms PCA in terms of representation of both person and item locations, especially for the exponential unfolding model. Finally, it is shown that undoubled CA outperforms doubled CA for both types of unfolding models. It is argued that performing CA on the raw data matrix is an unconventional, but meaningful approach to scaling items and persons on an underlying unfolding scale. A real data example on personality assessment is given, which shows that for this type of data (undoubled) CA is to be preferred over PCA.

Introduction

In this paper we explore the surplus value of correspondence analysis (CA) over principal component analysis (PCA) for analyzing one-dimensional, single-peaked responses, i.e. data conforming to a one-dimensional unfolding model. We will discuss continuous, binary and graded responses.

Single-peaked (unimodal) data naturally arise in a variety of research settings, e.g. marketing research (for example, DeSarbo et al. (2002)), ecological research (for example, De’ath (1999)), and archeology (for example, Kendall (1971)). In psychology single-peaked response curves can be found, for instance, in attitude measurement (Roberts et al., 2000): people with moderate tolerance towards abortion are less likely to agree with items that are either very much in favor of abortion or very much against it.

The essence of an unfolding model is that the probability of agreement with a certain item is inversely related to the distance between the position of the item on the latent continuum and the position of the respondent; the closer an item is located near the respondent’s position on the latent continuum, the more likely the respondent will agree with it. In this case, the latent continuum is called bipolar: ranging from a negative extreme (very much against abortion), via a neutral midpoint (neither against nor in favor of abortion), to a positive extreme (very much in favor of abortion). In the unfolding literature the positions of respondents or objects on this continuum are referred to as ideal points (Coombs, 1964).

There is a vast amount of literature on the inappropriateness of PCA for analyzing data conforming to an unfolding model (Coombs and Kao, 1960, Ross and Cliff, 1964, Davison, 1977, Van Schuur and Kiers, 1994, Van Schuur and Kruijtbosch, 1995, Andrich, 1996, Rost and Luo, 1997, Andrich and Styles, 1998, Roberts et al., 1999, Roberts et al., 2000, Maraun and Rossi, 2001). The main conclusions from this literature are, first, that PCA of one-dimensional unfolding data results in a two-component solution, leading to erroneous conclusions about the dimensionality of the data. Second, the component scores of the persons with extreme positions on the latent scale underestimate the true positions. Third, component loadings of the items with extreme positions on the latent scale are underestimated, resulting in a non-optimal item selection. In the two-component PCA solution both persons and items either lie on a semi-circle (cf. Davison (1977)) or on a semi-circle with inwardly folded endpoints (cf. Roberts et al. (2000)). This inward bending of the endpoints is what is meant by underestimation of the true positions of the extreme persons and items. The current paper offers CA as an alternative to PCA and relates the different problems with PCA, described in the above, to the distinction between two types of unfolding models. To start with the latter, on the one hand, we have models that are a quadratic function of the person-to-item distances, and on the other hand, we have models that are an exponential function of these distances.

The often quoted paper by Davison (1977, see also Maraun and Rossi (2001)) discusses the quadratic unfolding model. For data conforming to this type of model, PCA only suffers the aforementioned “extra-component” problem, but not the problem of underestimation of the locations of extreme persons or items. That is, in the two-component solution, the person and item locations lie on a semi-circle. Furthermore, the inter-item correlation matrix of this type of data shows a “simplex-like” pattern, also referred to as Robinson pattern (Hubert et al., 1998). That is, when the items are ordered according to their location on the latent scale, the correlations along the diagonal of the matrix will be highly positive, moving downward and to the left, the correlations will decrease first to zero, and will decrease further to negative in the lower left-hand corner.

The papers from the field of unfolding item response theory (IRT) (for example, Andrich (1996), Rost and Luo (1997), Andrich and Styles (1998) and Roberts et al. (2000)) discuss exponential unfolding models. For data conforming to this type of models, PCA suffers both the “extra-component” problem, and the problem of underestimation of the locations of extreme persons and items. That is, in the two-component solution, the person and item locations lie on a semi-circle with inwardly bending extremes, a pattern which can be described as a “horseshoe” pattern (cf. Greenacre (1984, p. 226–232)). The problem of the inwardly bending extremes in the PCA solution has been discussed also in the field of ecology (Swan, 1970, Noy-Meir and Austin, 1970, Hill, 1973, De’ath, 1999).

In this paper CA is proposed as an alternative to PCA, since CA is known to represent single-peaked data correctly: Ter Braak (1985) showed that CA approximates the maximum likelihood solution of the Gaussian ordination model; Heiser (1981) showed that CA recovers the person and item order of error-free ratings conforming to an unfolding model. In Section 1.1, using CA as unfolding technique is explained further.

It is known that when unfolding data are strongly one-dimensional, a two-dimensional CA representation will show what is often referred to as the “arch-effect”, where the items and persons are ordered along an arch (but also on the first dimension) according to their position on the scale (for example, Hill (1974) and Hill and Gauch (1980)). We prefer the term “arch” to “horseshoe”, to stress the importance of the outward bending of the extremes of the arch. In that case, the first dimension reflects the correct order of items and persons, as opposed to solutions with inwardly bending extremes (which is the usual shape of a horseshoe), where the order of the items and persons gets mixed up at the endpoints of the first dimension.

When rating scale data are analyzed with CA, the variables are usually “doubled” to create pairs of variables, which form the positive and the negative poles of the rating scale (see for example, Greenacre (1993, Chapter 19), and Greenacre (2007, Chapter 23)). We will explain this type of data coding in CA in Section 1.2. In this paper we show that when unfolding data are doubled, CA, like PCA, is hampered by the undesirable inward bending of the extremes.

In this paper CA with and without doubling is compared to PCA with and without varimax rotation. For this purpose, we simulated continuous, binary, and graded responses using three different unfolding models, which are described in Section 1.3. The first is the quadratic unfolding model as discussed by Davison (1977), which results in continuous responses. Of the exponential unfolding model two variants are compared: the Gaussian ordination model (Ihm and van Groenewoud, 1984), which results in binary responses and the generalized graded unfolding model (Roberts et al., 2000), which results in graded responses. Furthermore, an empirical data set concerning the measurement of personality development was analyzed.

In this paragraph we explain to what types of data CA can be applied, and we explain the rationale behind using CA as unfolding technique. CA is a multivariate technique primarily developed for the analysis of contingency table data (Benzécri, 1992, Greenacre, 1984). However, the technique can be applied to a broader range of data types, as long as the entries of the table contain measures of association strength between row entries and column entries. The association measure is assumed to be some non-negative quantity, where lack of association is indicated by a zero entry (Heiser, 2001).

In the current paper, we use CA as an unfolding technique. Typically an advantage of CA in this context is that it simultaneously scales both persons and items. Of the three most common normalizations in CA (i.e. row principal, column principal, and symmetrical normalization) we choose row principal normalization, so that a person is represented as the centroid (weighted average, with weights proportional to the ratings) of the items he has rated. This approach results in an interpretation of person scores as ideal points (Coombs, 1964). A higher rating of a given person on a given item results in a smaller person-to-item distance in the CA solution. Hence the expected responses are a single-peaked function of the person scores in the CA solution.

The distances between persons in a CA solution with row principal normalization approximate chi-square distances from below (Meulman, 1982, p. 33). The chi-square distance between two persons differs from the usual Euclidean distance, in that for each item, the squared difference between the persons’ scores is weighted by the inverse marginal proportion (i.e. the mass) of each item. As a consequence, persons and items with low mass tend to lie more in the periphery of the CA solution (see for example Ter Braak and Prentice (1988), reprinted in Ter Braak and Prentice (2004, p. 262–263)). In the context of attitude items with ratings ranging from 0 (totally disagree) to 5 (totally agree), an item that few persons choose, i.e. an extreme item will have a low mass. Analogously, a person who agrees with only one item, is very likely to have an extreme opinion, and will have a low mass. We will show that for these extreme items and persons CA (without doubling) typically results in appropriate scale values.

In this paragraph we discuss two types of data coding in CA: undoubled and doubled data. These two approaches are also known as, respectively, asymmetric and symmetric treatment of response categories (see Gifi (1990, p. 294–295)).

Asymmetric treatment of response categories implies performing CA on the raw data table, where, for the simple case of binary responses, disagreement is denoted with 0 and agreement with 1. In effect, only agreement implies similarity between respondents, and not disagreement.

Symmetric treatment of the response categories demands a type of recoding of the data commonly known as “doubling” (see, for example, Benzécri (1992) or (Greenacre, 1984)). This is a type of data coding that complements a respondent’s original ratings with the reverse of these ratings, which are obtained by subtracting the ratings from the maximum rating. For example, for a person with the ratings 0, 2, and 4 on three items with a six-point scale ranging from 0 to 5, the complete set of doubled scores would be 0, 2, 4 along with 5, 3, 1. In effect, both shared agreement with a certain item and shared disagreement imply similarity between respondents. An argument for this procedure is, that agreement with a statement is the same as disagreement with the opposite of this statement, so that all items need not be worded in the same direction.

However, in Heiser (1981, Chapter 4) as well as in Benzécri (1992, p. 391, where we assume that in the final paragraph on p. 391 the word “not” is missing by mistake after the word “is” in the sentence “But the presence or absence in a plant of a quality such as being a perennial is of the same nature”) it is stressed that if response categories are thought to give an asymmetrical type of information, CA should be preformed on the undoubled (raw) data. Even when all items are not worded in the same direction, no reverse scoring is needed, as long as the disagree-category is coded with a zero score. In this case, the “attraction power” of items to persons, which is reflected in small person-to-item distances the CA solution, is determined by high ratings. As a consequence, the proximity between persons in the CA solution depends on (the level of) shared agreement, and not on shared disagreement.

The argument for asymmetric treatment of the response categories is that a respondent can have only one reason for agreeing with a certain statement, but either one of two different reasons for disagreeing. That is, a respondent disagrees with the statement when he is either too “positive” to agree with the statement or too “negative”.

In the following we discuss the three different unfolding models that were used to generate single-peaked response data. We classify these models as either one of two different types of unfolding models, that is, quadratic or exponential. The first model is a quadratic function of the person-to-item distances, whereas the second and the third model are exponential functions of these distances.

To recognize single-peaked data empirically, Davison (1977) postulated predictions about the correlations and factor structure of responses zij to various items where the responses fit a metric, unidimensional unfolding model. Two models were compared. Firstly a model producing error free data: zij=aj(xiyj)2+bj, where

zij is the response of person i on item j;

aj is the discrimination parameter for item j;

xi is the ideal point for person i on the underlying continuum;

yj is the location of item j on the underlying continuum;

bj is the maximum of the curve for item j.

The discrimination parameter for a given item j, aj, indicates the steepness of the response curve. In ecology, the inverse of the discrimination parameter is called the tolerance of species j, which is a measure of ecological amplitude. That is, the steeper the response curve, the smaller a species tolerance. Note that aj< 0, otherwise the response curve would have a minimum instead of a maximum.

Secondly, Davison discussed a model producing fallible data: zij=aj(xiyj+Eij)2=(xiyj)2+σEij2+eij, where

Eij is a random normal deviate;

σEij2 is the variance of Eij;

eij=zij(xiyj)2σEij2.

Under the assumption of model (1) with aj=1, it follows from the results of Ross and Cliff (1964) that the matrix Z with elements zij has rank 3. One of the three components involves the quantities yj2+bj, which are constant across the rows of Z, the other one involves the quantities xi2, which are constant across the columns of Z, and the third one the xi and yj themselves. In addition, Ross and Cliff showed that centering the columns of Z reduces its rank to two. In addition to these results, Davison (1977) concluded that (a) the item by item correlation matrix displayed a simplex-like pattern, (b) the signs of first-order partial correlations can be specified in an empirically testable manner, and (c) the items will have a semi-circular, two-factor structure. Along the semi-circle, variables will be ordered by their positions on the latent dimension. This latter fact is influenced by the amount of error included in the model. The most extreme items become mixed up with the last but one extreme items. These conclusions were based on data sets with 100 persons and 10 items, where the items had fixed equally spaced true scale values yj ranging from −3.00 to +3.00, and the 100 person scores xi were randomly sampled from a normal distribution, N(0,1). It turned out that the correlations and factor structure were robust to non-normality of the person score distribution.

It should be noted that in CA not only the columns are centered, but the rows as well (double centering; Gifi (1990, Chapter 8)). For this case, Schoenemann (1970) showed that double centering of Z further reduces the rank to one, and that the x- and y-scores are recovered up to a scale factor. Therefore, when we generate data under the Davison model, we will obtain exactly one component with non-zero inertia in CA, due to the double centering. However, the joint scale of the scores depends on the chosen normalization, and may not be equal to the original one.

Here it will be shown that CA approximates the Gaussian ordination model. This is a well-known model in the field of ecology for the single-peaked relationships between the abundance of a species and some environmental variable. However, it could also model the single-peaked relationships between the attitude of a person and some attitude item. Results follow from Ter Braak (1987) and Ihm and van Groenewoud (1984). We will start with the Gaussian ordination model as proposed by Ihm and van Groenewoud. This model is somewhat more general than the standard model since it has an extra parameter (αi) to account for different masses of the persons. The response zij of person i on item j is approximated by a model using maximum likelihood given a binomial (or multinomial) distribution. The Gaussian ordination model is πij=αiβjexp(12(xiyj)2/tj2), where

πij is the probability that person i agrees with item j;

xi is the ideal point for person i on the underlying continuum;

yj is the location of item j on the underlying continuum;

βj is the maximum of the curve for item j;

tj2 is the discrimination parameter for item j.

Assuming tj=t (equal discrimination parameters) we can rewrite (3) into πij=αiβjexp(xiyj),

with αi=αi/exp(xi22t2) and βj=βj/exp(yj22t2).

Using the Taylor expansion of first order we obtain πijαiβj(1+xiyj).

The least-squares estimate of αiβj is αiβĵ=zi+z+jz++.

Inserting this expression in (5) we obtain πijzi+z+jz++(1+xiyj), which is the CA model with one component. Note that the first-order Taylor expansion works well for small values of the interaction term xiyj. But the relation of CA with Gaussian ordination model holds true as well for large values (Ter Braak, 1985, Ter Braak, 1987). See also Ter Braak (1988) and Zhu et al. (2005) for this link in constrained CA.

The generalized graded unfolding model (GGUM) is a parametric item response model that has been well developed and incorporates features such as variable item discrimination and variable threshold parameters for the response categories (Roberts et al., 2000). The GGUM allows for binary or graded responses, but will be used in the current paper to generate responses on a six-point rating scale. One premise of the GGUM is that for each person there are two subjective responses associated with each observable response, except for the totally agree response. These subjective responses can be seen as two distinct reasons for a person’s response. For instance, when a person strongly disagrees with a certain items this could be for either of two reasons. If on the underlying continuum the item is located more to the right extreme than the person, the person disagrees from below the item. However, if the item is located more to the left extreme than the person, the person disagrees from above the item. The probability that a person will respond using a particular observable answer category is defined as the sum of the probabilities associated with the two corresponding subjective responses. Specifically, the model has the form: P(Zj=z|xi)=exp{tj[z(xiyj)k=0zτjk]}+exp{tj[(Mz)(xiyj)k=0zτjk]}ω=0C(exp{tj[ω(xiyj)k=0ωτjk]}+exp{tj[(Mω)(xiyj)k=0ωτjk]}), where

Zj is an observable response to attitude item j;

z=0(z=0,1,2,,C) corresponds to the strongest level of disagreement;

z=C corresponds to the strongest level of agreement;

M is the number of subjective response categories minus 1;

C is the number of response categories minus 1 (M=2C+1);

xi is the location of person i on the attitude continuum;

yj is the location of item j on the attitude continuum;

tj is the discrimination of attitude statement j; and

τjk is the location of the kth subjective response category threshold on the attitude continuum relative to the location of item j.

Section snippets

Method

The aim of the present research is to compare the performance of CA (with and without doubled items) and PCA (with and without varimax rotation) in terms of the recovery of the “true” scale values. Three types of scale values were of interest: person scale values, item scale values, and scale values of persons and items taken together, referred to as the joint scale.

We chose to include CA with doubled items, with the aim of testing the presumption that, in the case of unfolding data, asymmetric

The three benchmark datasets

This section of results consists of two parts. First the matrices of inter-item correlations for the three benchmark datasets are compared. Second, the results of the two types of CA are compared to the results of the two types of PCA.

The inter-item correlations for the benchmark data conforming to model 1 are displayed in Table 1. The correlation matrix shows a strong Robinson pattern. The inter-item correlations for the benchmark datasets conforming to model 2 and 3 are similar with respect

Discussion

Across all analyses, CA without doubling performs best for unfolding data generated with three different single-peaked models. We have to make one reservation however.

Both the analysis results for the three unfolding benchmark datasets and the results of the simulation study showed that in the case of the model 1 data CA recovered the joint scale poorly, whereas CA of the doubled data recovered the joint scale well. This is an exception in the current and existing results referred to in this

Acknowledgement

This research was conducted while Mark de Rooij was sponsored by the Netherlands Organisation for Scientific Research (NWO), Innovational Grant, no. 452-06-002.

References (42)

  • W.J. Heiser

    Correspondence analysis

  • C.J.F. Ter Braak et al.

    A theory of gradient analysis

    Advances in Ecological Research

    (1988)
  • C.J.F. Ter Braak et al.

    A theory of gradient analysis

    Advances in Ecological Research

    (2004)
  • M. Zhu et al.

    Constrained ordination analysis with flexible response functions

    Ecological Modelling

    (2005)
  • R.E. Abraham et al.

    The developmental profile

    Journal of Personality Disorders

    (2001)
  • D. Andrich

    A hyperbolic cosine latent trait model for unfolding polytomous responses: Reconciling Thurstone and Likert methodologies

    British Journal of Mathematical and Statistical Psychology

    (1996)
  • D. Andrich et al.

    The structural relationship between attitude and behavior statements from the unfolding perspective

    Psychological Methods

    (1998)
  • J.-P. Benzécri

    Correspondence Analysis Handbook

    (1992)
  • C.H. Coombs

    A Theory of Data

    (1964)
  • C.H. Coombs et al.

    On a connection between factor analysis and multidimensional unfolding

    Psychometrika

    (1960)
  • M.L. Davison

    On a metric unidimensional unfolding model for attitudinal and developmental data

    Psychometrika

    (1977)
  • G. De’ath

    Principal curves: A technique for indirect and direct gradient analysis

    Ecology

    (1999)
  • W.S. DeSarbo et al.

    A gravity based multidimensional scaling model for deriving spatial structures underlying consumer preference/choice judgements

    Journal of Consumer Research

    (2002)
  • A. Gifi

    Nonlinear Multivariate Analysis

    (1990)
  • M.J. Greenacre

    Theory and Applications of Correspondence Analysis

    (1984)
  • M.J. Greenacre

    Correspondence Analysis in Practice

    (1993)
  • Greenacre, M.J., 2006. Tying up the loose end in simple correspondence analysis. Working...
  • M.J. Greenacre

    Correspondence Analysis in Practice

    (2007)
  • Heiser, W.J., 1981. Unfolding analysis of proximity data. Ph.D. Thesis, Leiden, The Netherlands, Leiden...
  • M.O. Hill

    Reciprocal averaging: An eigenvector method of ordination

    Journal of Ecology

    (1973)
  • M.O. Hill

    Correspondence analysis: A neglected multivariate method

    Applied Statistics

    (1974)
  • Cited by (10)

    • Model-based simultaneous clustering and ordination of multivariate abundance data in ecology

      2017, Computational Statistics and Data Analysis
      Citation Excerpt :

      Examples of algorithm-based techniques include Ward clustering (Ward, 1963) and K-means clustering for classification, and Non-metric Multidimensional Scaling (NMDS, Kruskal and Wish, 1978) and Correspondence Analysis (CA, Hill, 1974) for ordination. The development of algorithm-based techniques for analyzing multivariate data in general remains an ongoing area of research (e.g., Polak et al., 2009; Gijbels and Omelka, 2013). In contrast to algorithm-based methods, clustering and ordination can be approached from a model-based framework.

    • Special issue on correspondence analysis and related methods

      2009, Computational Statistics and Data Analysis
      Citation Excerpt :

      Applications of the latter option are made to linguistic and population genetic data and the idea of introducing power transformations or other parametrizations is extended to related methods such as principal component analysis, nonsymmetrical correspondence analysis and multidimensional scaling. In “Two types of single-peaked data: correspondence analysis as an alternative to principal component analysis”, Polak et al. (2009) compare various alternative approaches for the component-style analysis of ratings data that conform to unfolding distance models. They use two types of simulated unfolding data as gauges: first where the ratings are quadratic functions of person-to-item distances, and second where they are exponential functions.

    • A General Unfolding IRT Model for Multiple Response Styles

      2019, Applied Psychological Measurement
    • Generalized Graded Unfolding Model

      2018, Handbook of Item Response Theory: Three Volume Set
    View all citing articles on Scopus
    View full text