首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Steinley D 《心理学方法》2006,11(2):178-192
Using the cluster generation procedure proposed by D. Steinley and R. Henson (2005), the author investigated the performance of K-means clustering under the following scenarios: (a) different probabilities of cluster overlap; (b) different types of cluster overlap; (c) varying samples sizes, clusters, and dimensions; (d) different multivariate distributions of clusters; and (e) various multidimensional data structures. The results are evaluated in terms of the Hubert-Arabie adjusted Rand index, and several observations concerning the performance of K-means clustering are made. Finally, the article concludes with the proposal of a diagnostic technique indicating when the partitioning given by a K-means cluster analysis can be trusted. By combining the information from several observable characteristics of the data (number of clusters, number of variables, sample size, etc.) with the prevalence of unique local optima in several thousand implementations of the K-means algorithm, the author provides a method capable of guiding key data-analysis decisions.  相似文献   

2.
A Monte Carlo evaluation of thirty internal criterion measures for cluster analysis was conducted. Artificial data sets were constructed with clusters which exhibited the properties of internal cohesion and external isolation. The data sets were analyzed by four hierarchical clustering methods. The resulting values of the internal criteria were compared with two external criterion indices which determined the degree of recovery of correct cluster structure by the algorithms. The results indicated that a subset of internal criterion measures could be identified which appear to be valid indices of correct cluster recovery. Indices from this subset could form the basis of a permutation test for the existence of cluster structure or a clustering algorithm.  相似文献   

3.
A variable-selection heuristic for K-means clustering   总被引:4,自引:0,他引:4  
One of the most vexing problems in cluster analysis is the selection and/or weighting of variables in order to include those that truly define cluster structure, while eliminating those that might mask such structure. This paper presents a variable-selection heuristic for nonhierarchical (K-means) cluster analysis based on the adjusted Rand index for measuring cluster recovery. The heuristic was subjected to Monte Carlo testing across more than 2200 datasets with known cluster structure. The results indicate the heuristic is extremely effective at eliminating masking variables. A cluster analysis of real-world financial services data revealed that using the variable-selection heuristic prior to the K-means algorithm resulted in greater cluster stability.  相似文献   

4.
Although the K-means algorithm for minimizing the within-cluster sums of squared deviations from cluster centroids is perhaps the most common method for applied cluster analyses, a variety of other criteria are available. The p-median model is an especially well-studied clustering problem that requires the selection of p objects to serve as cluster centers. The objective is to choose the cluster centers such that the sum of the Euclidean distances (or some other dissimilarity measure) of objects assigned to each center is minimized. Using 12 data sets from the literature, we demonstrate that a three-stage procedure consisting of a greedy heuristic, Lagrangian relaxation, and a branch-and-bound algorithm can produce globally optimal solutions for p-median problems of nontrivial size (several hundred objects, five or more variables, and up to 10 clusters). We also report the results of an application of the p-median model to an empirical data set from the telecommunications industry.  相似文献   

5.
Perhaps the most common criterion for partitioning a data set is the minimization of the within-cluster sums of squared deviation from cluster centroids. Although optimal solution procedures for within-cluster sums of squares (WCSS) partitioning are computationally feasible for small data sets, heuristic procedures are required for most practical applications in the behavioral sciences. We compared the performances of nine prominent heuristic procedures for WCSS partitioning across 324 simulated data sets representative of a broad spectrum of test conditions. Performance comparisons focused on both percentage deviation from the “best-found” WCSS values, as well as recovery of true cluster structure. A real-coded genetic algorithm and variable neighborhood search heuristic were the most effective methods; however, a straightforward two-stage heuristic algorithm, HK-means, also yielded exceptional performance. A follow-up experiment using 13 empirical data sets from the clustering literature generally supported the results of the experiment using simulated data. Our findings have important implications for behavioral science researchers, whose theoretical conclusions could be adversely affected by poor algorithmic performances.  相似文献   

6.
The popular K-means clustering method, as implemented in 3 commercial software packages (SPSS, SYSTAT, and SAS), generally provides solutions that are only locally optimal for a given set of data. Because none of these commercial implementations offer a reasonable mechanism to begin the K-means method at alternative starting points, separate routines were written within the MATLAB (Math-Works, 1999) environment that can be initialized randomly (these routines are provided at the end of the online version of this article in the PsycARTICLES database). Through the analysis of 2 empirical data sets and 810 simulated data sets, it is shown that the results provided by commercial packages are most likely locally optimal. These results suggest the need for some strategy to study the local optima problem for a specific data set or to identify methods for finding "good" starting values that might lead to the best solutions possible.  相似文献   

7.
Abstract

Distance association models constitute a useful tool for the analysis and graphical representation of cross-classified data in which distances between points inversely describe the association between two categorical variables. When the number of cells is large and the data counts result in sparse tables, the combination of clustering and representation reduces the number of parameters to be estimated and facilitates interpretation. In this article, a latent block distance-association model is proposed to apply block clustering to the outcomes of two categorical variables while the cluster centers are represented in a low dimensional space in terms of a distance-association model. This model is particularly useful for contingency tables in which both the rows and the columns are characterized as profiles of sets of response variables. The parameters are estimated under a Poisson sampling scheme using a generalized EM algorithm. The performance of the model is tested in a Monte Carlo experiment, and an empirical data set is analyzed to illustrate the model.  相似文献   

8.
郭磊  杨静  宋乃庆 《心理科学》2018,(3):735-742
聚类分析已成功用于认知诊断评估(CDA)中,使用广泛的聚类分析方法为K-means算法,有研究已证明K-means在CDA中具有较好的聚类效果。而谱聚类算法通常比K-means分类效果更佳,本研究将谱聚类算法引进CDA,探讨了属性层级结构、属性个数、样本量和失误率对该方法的影响。研究发现:(1)谱聚类算法要比K-means提供更好的聚类结果,尤其在实验条件较苛刻时,谱聚类算法更加稳健;(2)线型结构聚类效果最好,收敛型和发散型相近,独立型结构表现较差;(3)属性个数和失误率增加后,聚类效果会下降;(4)样本量增加后,聚类效果有所提升,但K-means方法有时会有反向结果出现。  相似文献   

9.
A split-sample replication criterion originally proposed by J. E. Overall and K. N. Magee (1992) as a stopping rule for hierarchical cluster analysis is applied to multiple data sets generated by sampling with replacement from an original simulated primary data set. An investigation of the validity of this bootstrap procedure was undertaken using different combinations of the true number of latent populations, degrees of overlap, and sample sizes. The bootstrap procedure enhanced the accuracy of identifying the true number of latent populations under virtually all conditions. Increasing the size of the resampled data sets relative to the size of the primary data set further increased accuracy. A computer program to implement the bootstrap stopping rule is made available via a referenced Web site.  相似文献   

10.
McLachlan GJ 《心理学方法》2011,16(1):80-1; discussion 89-92
I discuss the recommendations and cautions in Steinley and Brusco's (2011) article on the use of finite models to cluster a data set. In their article, much use is made of comparison with the K-means procedure. As noted by researchers for over 30 years, the K-means procedure can be viewed as a special case of finite mixture modeling in which the components are in equal (fixed) proportions and are taken to be normal with a common spherical covariance matrix. In this commentary, I pay particular attention to this link and to the use of normal mixture models with arbitrary component-covariance matrices.  相似文献   

11.
This paper proposes an order-constrained K-means cluster analysis strategy, and implements that strategy through an auxiliary quadratic assignment optimization heuristic that identifies an initial object order. A subsequent dynamic programming recursion is applied to optimally subdivide the object set subject to the order constraint. We show that although the usual K-means sum-of-squared-error criterion is not guaranteed to be minimal, a true underlying cluster structure may be more accurately recovered. Also, substantive interpretability seems generally improved when constrained solutions are considered. We illustrate the procedure with several data sets from the literature.  相似文献   

12.
A measure of “clusterability” serves as the basis of a new methodology designed to preserve cluster structure in a reduced dimensional space. Similar to principal component analysis, which finds the direction of maximal variance in multivariate space, principal cluster axes find the direction of maximum clusterability in multivariate space. Furthermore, the principal clustering approach falls into the class of projection pursuit techniques. Comparisons are made with existing methodologies both in a simulation study and analysis of real-world data sets. Furthermore, a demonstration of how to interpret the results of the principal cluster axes is provided on the analysis of Supreme Court voting data and similarities between the interpretation of competing procedures (e.g., factor analysis and principal component analysis) are provided. In addition to the Supreme Court analysis, we analyze several data sets often used to test cluster analysis procedures, including Fisher's Iris data, Agresti's Crab data, and a data set on glass fragments. Finally, discussion is provided to help determine when the proposed procedure will be the most beneficial to the researcher.  相似文献   

13.
Minimization of the within-cluster sums of squares (WCSS) is one of the most important optimization criteria in cluster analysis. Although cluster analysis modules in commercial software packages typically use heuristic methods for this criterion, optimal approaches can be computationally feasible for problems of modest size. This paper presents a new branch-and-bound algorithm for minimizing WCSS. Algorithmic enhancements include an effective reordering of objects and a repetitive solution approach that precludes the need for splitting the data set, while maintaining strong bounds throughout the solution process. The new algorithm provided optimal solutions for problems with up to 240 objects and eight well-separated clusters. Poorly separated problems with no inherent cluster structure were optimally solved for up to 60 objects and six clusters. The repetitive branch-and-bound algorithm was also successfully applied to three empirical data sets from the classification literature.  相似文献   

14.
A Monte Carlo evaluation of 30 procedures for determining the number of clusters was conducted on artificial data sets which contained either 2, 3, 4, or 5 distinct nonoverlapping clusters. To provide a variety of clustering solutions, the data sets were analyzed by four hierarchical clustering methods. External criterion measures indicated excellent recovery of the true cluster structure by the methods at the correct hierarchy level. Thus, the clustering present in the data was quite strong. The simulation results for the stopping rules revealed a wide range in their ability to determine the correct number of clusters in the data. Several procedures worked fairly well, whereas others performed rather poorly. Thus, the latter group of rules would appear to have little validity, particularly for data sets containing distinct clusters. Applied researchers are urged to select one or more of the better criteria. However, users are cautioned that the performance of some of the criteria may be data dependent.The authors would like to express their appreciation to a number of individuals who provided assistance during the conduct of this research. Those who deserve recognition include Roger Blashfield, John Crawford, John Gower, James Lingoes, Wansoo Rhee, F. James Rohlf, Warren Sarle, and Tom Soon.  相似文献   

15.
In the application of clustering methods to real world data sets, two problems frequently arise: (a) how can the various contributory variables in a specific battery be weighted so as to enhance some cluster structure that may be present, and (b) how can various alternative batteries be combined to produce a single clustering that best incorporates each contributory set. A new method is proposed (SYNCLUS, SYNthesizedCLUStering) for dealing with these two problems.We wish to thank Anne Freeny and Deborah Art for their computer assistance, and Ed Fowlkes for his helpful technical discussion. We would also like to acknowledge the insightful and helpful comments from the editor and reviewers.  相似文献   

16.
This article provides a large-scale investigation into several of the properties of mixture-model clustering techniques (also referred to as latent class cluster analysis, latent profile analysis, model-based clustering, probabilistic clustering, Bayesian classification, unsupervised learning, and finite mixture models; see Vermunt & Magdison, 2002). Focus is given to the multivariate normal distribution, and 9 separate decompositions (i.e., class structures) of the covariance matrix are investigated. To provide a link to the current literature, comparisons are made with K-means clustering in 3 detailed Monte Carlo studies. The findings have implications for applied researchers in that mixture-model clustering techniques performed best when the covariance structure and number of clusters were known. However, as the information about the shape and number of clusters became unknown, degraded performance was observed for both K-means clustering and mixture-model clustering.  相似文献   

17.
This paper develops a new procedure, called stability analysis, for K‐means clustering. Instead of ignoring local optima and only considering the best solution found, this procedure takes advantage of additional information from a K‐means cluster analysis. The information from the locally optimal solutions is collected in an object by object co‐occurrence matrix. The co‐occurrence matrix is clustered and subsequently reordered by a steepest ascent quadratic assignment procedure to aid visual interpretation of the multidimensional cluster structure. Subsequently, measures are developed to determine the overall structure of a data set, the number of clusters and the multidimensional relationships between the clusters.  相似文献   

18.
Steinley (2007) provided a lower bound for the sum-of-squares error criterion function used in K-means clustering. In this article, on the basis of the lower bound, the authors propose a method to distinguish between 1 cluster (i.e., a single distribution) versus more than 1 cluster. Additionally, conditional on indicating there are multiple clusters, the procedure is extended to determine the number of clusters. Through a series of simulations, the proposed methodology is shown to outperform several other commonly used procedures for determining both the presence of clusters and their number.  相似文献   

19.
The work in this paper introduces finite mixture models that can be used to simultaneously cluster the rows and columns of two-mode ordinal categorical response data, such as those resulting from Likert scale responses. We use the popular proportional odds parameterisation and propose models which provide insights into major patterns in the data. Model-fitting is performed using the EM algorithm, and a fuzzy allocation of rows and columns to corresponding clusters is obtained. The clustering ability of the models is evaluated in a simulation study and demonstrated using two real data sets.  相似文献   

20.
The clustering of two-mode proximity matrices is a challenging combinatorial optimization problem that has important applications in the quantitative social sciences. We focus on one particular type of problem related to the clustering of a two-mode binary matrix, which is relevant to the establishment of generalized blockmodels for social networks. In this context, clusters for the rows of the two-mode matrix intersect with clusters of the columns to form blocks, which should ideally be either complete (all 1s) or null (all 0s). A new procedure based on variable neighborhood search is presented and compared to an existing two-mode K-means clustering algorithm. The new procedure generally provided slightly greater explained variation; however, both methods yielded exceptional recovery of cluster structure.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号