A cautionary note on using internal cross validation to select the number of clusters |
| |
Authors: | Abba M. Krieger Paul E. Green |
| |
Affiliation: | (1) Department of Statistics, University of Pennsylvania, USA;(2) Marketing Department, The Wharton School, University of Pennsylvania, 1400 Steinberg Hall-Dietrich Hall, 19104-6371 Philadelphia, PA |
| |
Abstract: | A highly popular method for examining the stability of a data clustering is to split the data into two parts, cluster the observations in Part A, assign the objects in Part B to their nearest centroid in Part A, and then independently cluster the Part B objects. One then examines how close the two partitions are (say, by the Rand measure). Another proposal is to split the data into k parts, and see how their centroids cluster. By means of synthetic data analyses, we demonstrate that these approaches fail to identify the appropriate number of clusters, particularly as sample size becomes large and the variables exhibit higher correlations.The authors express their thanks to the Sol C. Snider Entrepreneurial Center, Wharton School, for support of this project. |
| |
Keywords: | cluster analysis cross-validation stopping rules |
本文献已被 SpringerLink 等数据库收录! |
|