Subject: what is factor vs. component analysis, and what good is either Siva Ganesh says: >Date: Fri, 7 Aug 1998 08:05:22 GMT+12 >From: Siva Ganesh>Subject: Re: Crescent of points in factor analysis. > >I don't have a direct answer to your particular question, perhaps I >will think about it in the next couple of days. But, my question is, >'did you really do a "FACTOR ANALYSIS" or was it simply a "PRINCIPAL >COMPONENT ANALYSIS"? I have this notion that factor analysis is not >really very useful is pure sciences (such as your study). >I hate the fact that many computer software packages confuse between >factor analysis and principal component analysis (eg. SPSS, SAS in >its proc factor (the default is PCA), ...). A simple diagram I use >for explaining the difference is, > >FACTORS ---> DATA (VARIABLES...) ---> PRINCIPAL COMPONENTS > >Ganesh. >--------------------------------------------------- The above posting raises at least two issues; one is >'did you really do a "FACTOR ANALYSIS" or was it simply a "PRINCIPAL >COMPONENT ANALYSIS"? ... >I hate the fact that many computer software packages confuse between >factor analysis and principal component analysis (eg. SPSS, SAS and second issue is: >I have this notion that factor analysis is not >really very useful is pure sciences (such as your study). Dr. Ganesh is not alone in having such views, and I think this is one reason why they are worth careful consideration and a careful reply. But I will argue below that these views arise from an incomplete or mistaken picture of factor analysis. Let me explain. (Re: Dr. Ganesh's issue 1) Unfortunately, there is inconsistency across scientific fields (and even to some extent _within_ some fields) on the proper definitions of PCA and FA. One reason is that there are really three rather than two things to be distinguished: (a) a strict Principal Components Analysis in the mathematical sense; (b) an analysis that starts with a PCA, then selects a reduced number of components that are then rotated and interpreted (sometimes this is called factor analysis and sometimes principal component analysis); and (c) an analysis using a more complete and formal statistical model that involves fitting parameters describing the influence of a few latent variables or factors, and also includes other parameters that explicitly represent or adjust for effects of random (typically considered measurement) error. Here, random sampling of cases is assumed, with each case providing a vector of presumably correlated observations on multiple variables; characteristics of random errors are also specified. (Note that because of the least-squares properties of truncated PCA, method (b) above can be considered an ordinary-least-squares fitting of a statistical model but one that does not explicitly provide for the (biasing) effects of error on variances of variables.) Conceptually there are important differences between (a), (b), and (c). However, the procedures (b) and (c) usually give similar results and lead to the same interpretation. On the other hand, the results of procedure (a) differs quite strongly from (b) and (c), and in any case it is not appropriate to interpret it in terms of latent variables. By the way, to avoid confusion I mention that there is also a procedure (d) commonly called "confirmatory factor analysis" in which one makes more assumptions and so can compare fit of several explicitly stated models, and which also has explicit treatment of the effects of error, as in (c). But CFA is quite different in spirit and need not be discussed here (except to note that it is no more or less "statistically valid" than the common-factor exploratory method (c)). (Re: Dr. Ganesh's issue 2) The common use of method (b) rather than (c) in the "pure" (also known as "hard") sciences is probably due less to special appropriateness of the method and more to the relative unfamiliarity of researchers in those areas with Common Factor Analysis. But there is at least one other possible contributing reason: it is easier to ignore the difference between (b) and (c) in many hard science applications where there are low levels of error in the data. Some further explanation 1a. PRINCIPAL COMPONENTS ANALYSIS Mathematicians have a clear definition of PCA. To keep the discussion simple I won't give the full definition here but will simply note that it involves decomposition of a matrix into a sum of (outer products of) vectors called components, and that these are mutually orthogonal and (hence) have the property that the first component (or outer product) accounts for the maximal amount of variance of the matrix being decomposed that could be reproduced by any single component, the second explains maximal variance orthogonal to the first, etc. Also, there will be as many components as there are rows (or equivalently columns) in the original crossproduct matrix, and the reproduction of the matrix by the sum of these vector products will be exact. One common usage of this mathematical PCA in data analysis is when one plots, for example, the first vs. second (unrotated) components to see what interesting patterns are revealed. 1.b FACTOR ANALYSIS OR (MODIFIED) PCA As soon as you select fewer components and "rotate" them, you are going beyond the mathematical definition of PCA and are inventing something different. (E.g., because of rotation, your components no longer have pairwise orthogonality (in the sense that pairwise inner products are zero) and they no longer successively explain maximal amounts of residual variance (these properties are lost for both "orthogonal" and so called "oblique" rotation). Also, because of selecting fewer components, the summed contributions of components no longer explains all the variance in the matrix. Because of these differences, and, even more, because of a different objective of the analysis (explained below) the modified procedure is often referred to as "factor analysis". Nonetheless, it is also common in some quarters (in SPSS etc.) to use the name "Principal Components Analysis (PCA)" to refer to such a procedure. In this case, "PCA" is often considered a name for a particular _kind_ of factor analysis, that kind where you estimate factors by truncating the component matrix and then "rotating" this reduced set of components. A different name, such as "factor analysis", is particularly necessary when you truncate and rotate with the intent of giving each rotated axis a scientifically meaningful name, because here a scientific model is usually sought or intended--however dimly. Scientists in some fields (e.g., Chemistry) consistently use "factor analysis" to mean this kind of modified PCA, and it is clear that they intend the factors to have scientific reality. When they are decomposing, for example, a set of many spectral curves derived from many mixtures of a few compounds to find the latent spectra of the compounds that were in the mixtures, it is quite clear that their truncation and rotation are intended to uncover the few rotated factors that are scientifically generalizable, in fact that correspond to (the spectra of) real physical things. (For examples of such factor analysis applications and many other kinds as well, see, e.g., the book _Factor_Analysis_in _Chemistry_ by Malinowsky(?) now out in a 2nd edition --but unfortunately it's not in the building that I am currently in so I can't provide a correct or full citation at this time). The act of rotation has important implications. As just noted, rotation of Principal Components is often inspired by the desire to go beyond mere description of the dataset at hand. Each rotated axis is often interpreted as reflecting an underlying physical or biological processes, chemical component, etc. This type of exploratory factor analysis by modified PCA is quite different in spirit from mathematical principal component analysis conceived purely as a data compression and description method. For the descriptive PCA, a set of components can be considered a compressed version of a set of variables, where each component is a weighted linear combinations of the variables. Factors, in the sense discussed above, are different and should be considered (estimates of hypothetical) latent variables that affect the measured variables. Because of the way they are estimated they are not generally exact linear combinations of the variables. Thus, the picture that Dr. Ganesh provides is quite appropriate: > FACTORS ---> DATA (VARIABLES...) ---> PRINCIPAL COMPONENTS But given this, shouldn't we ask whether such recovery of information on latent variables is any less important in the hard or "pure" sciences than in the "soft" sciences? _Treatment of error in method b._ The selection of only the first few components is often an indirect way of acknowledging the effects of error: by consigning the remaining components to the trash heap one is saying that they reflect random or uninteresting perturbations due to error. 1c. COMMON FACTOR ANALYSIS Some would say "but then why stop with modified PCA?" When the scientist's intention is to give such a broader (inferential rather than purely descriptive) scientific meaning to the factors, one might want to construct a true *statistical model* for the crossproduct (or covariance, or correlation) matrix. In such a statistical model, the reduced set of components become a set of statistical parameters estimated as approximations of the population profiles of latent factors. However, good statistical models often include terms that explicitly describe or provide for the effects of random error. The truncated- PCA-followed-by-rotation procedure does not do that. Nonetheless, it can be argued that some adjustment for error is clearly needed. In any crossproduct matrix based on fallible data, the self-correlation of the error will cause the diagonal cells in the matrix to be "inflated" (biased upward). This biases (upward) the PCA based estimates of loadings, since these inflated diagonal cells are included in the data upon which the PCA components are based. A more refined method that tries to adjust for that problem is also generically referred to as "factor analysis" but is often distinguished from procedure (b) by the name "Common Factor Analysis". It explicitly provides for the effects of random error by introducing added parameters called uniquenesses, or, equivalently one minus the uniqueness which are called communalities. The diagonal elements of the crossproduct (or covariance, or correlation) matrix are replaced by these added model parameters, which are estimates of the sizes the diagonals would have had, had there been no error. Three other brief comments: 2 Classical multiple regression also fails to take into account the effects of measurement error on (predictor) variables. Although interesting work has been going on to change this, I think it is fair to say that we have been able to do a lot of good statistical work with the classical regression approach. 3 Although factor method (b) (factor analysis without error adjustments) seems clearly biased, it is sometimes claimed that it may have other benefits. It has been argued that (because of fewer parameters and/or perhaps for other reasons) the "Principal Components Analysis" method of obtaing factors is more robust than Common Factor Analysis in cases where some factor(s) are determined by only a few, and in the worst case by only two, variables with substantial loadings. 4 Issues involved in factor rotation, such as differences among factor rotation criteria, is probably a far more important issue than the provision for error or lack thereof in (b) or (c). The effect of rotation differences on the final interpretation can often be much greater. At the same time, the rationale and theoretical discussion of rotation methods pro and con is probably much less frequent and the issues much more often misunderstood --or simply disregarded-- in the current sceintific usage of exploratory factor analysis in many fields. However, some kinds of researchers in the hard sciences are more acutely aware of the problem and have done more to try to overcome it (see, for example, the above mentioned book on FA in chemistry for some examples). Richard A. Harshman, Psycholog Dpt., University of Western Ontario, London, Ontario, Canada. (lab) 519-661-3663, (office) 519-661-2111x4675, fax 519-661-3213.