TY - JOUR

T1 - Seeing distinct groups where there are none

T2 - spurious patterns from between-group PCA

AU - Cardini, A.

AU - O'Higgins, Paul

AU - Rohlf, F. J.

N1 - © Springer Science+Business Media, LLC, part of Springer Nature 2019. This is an author-produced version of the published paper. Uploaded in accordance with the publisher’s self-archiving policy. Further copying may not be permitted; contact the publisher for details.

PY - 2019/10/18

Y1 - 2019/10/18

N2 - Using sampling experiments,
we found that, when there are fewer groups than variables, between-groups PCA (bgPCA)
may suggest surprisingly distinct differences among groups for data in which
none exist. While apparently not noticed before, the reasons for this problem
are easy to understand. A bgPCA captures the g-1 dimensions of variation
among the g group means, but only a fraction of the∑ni-g
dimensions of within-group variation (
are the sample sizes), when the number of
variables, p, is greater than g-1. This introduces a distortion in
the appearance of the bgPCA plots because the within-group variation will be
underrepresented, unless the variables are sufficiently correlated so that the
total variation can be accounted for with just g-1 dimensions. The
effect is most obvious when sample sizes are small relative to the number of
variables, because smaller samples spread out less, but the distortion is
present even for large samples. Strong covariance among variables largely
reduces the magnitude of the problem, because it effectively reduces the
dimensionality of the data and thus enables a larger proportion of the
within-group variation to be accounted for within the g-1-dimensional
space of a bgPCA. The distortion will still be relevant though its strength
will vary from case to case depending on the structure of the data (p, g,
covariances etc.). These are important problems for a method mainly designed
for the analysis of variation among groups when there are very large numbers of
variables and relatively small samples. In such cases, users are likely to
conclude that the groups they are comparing are much more distinct than they
really are. Having many variables but
just small sample sizes is a common problem in fields ranging from
morphometrics (as in our examples) to molecular analyses.

AB - Using sampling experiments,
we found that, when there are fewer groups than variables, between-groups PCA (bgPCA)
may suggest surprisingly distinct differences among groups for data in which
none exist. While apparently not noticed before, the reasons for this problem
are easy to understand. A bgPCA captures the g-1 dimensions of variation
among the g group means, but only a fraction of the∑ni-g
dimensions of within-group variation (
are the sample sizes), when the number of
variables, p, is greater than g-1. This introduces a distortion in
the appearance of the bgPCA plots because the within-group variation will be
underrepresented, unless the variables are sufficiently correlated so that the
total variation can be accounted for with just g-1 dimensions. The
effect is most obvious when sample sizes are small relative to the number of
variables, because smaller samples spread out less, but the distortion is
present even for large samples. Strong covariance among variables largely
reduces the magnitude of the problem, because it effectively reduces the
dimensionality of the data and thus enables a larger proportion of the
within-group variation to be accounted for within the g-1-dimensional
space of a bgPCA. The distortion will still be relevant though its strength
will vary from case to case depending on the structure of the data (p, g,
covariances etc.). These are important problems for a method mainly designed
for the analysis of variation among groups when there are very large numbers of
variables and relatively small samples. In such cases, users are likely to
conclude that the groups they are comparing are much more distinct than they
really are. Having many variables but
just small sample sizes is a common problem in fields ranging from
morphometrics (as in our examples) to molecular analyses.

U2 - 10.1007/s11692-019-09487-5

DO - 10.1007/s11692-019-09487-5

M3 - Article

JO - Evolutionary Biology

JF - Evolutionary Biology

SN - 0071-3260

ER -