首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
A Monte Carlo evaluation of 30 procedures for determining the number of clusters was conducted on artificial data sets which contained either 2, 3, 4, or 5 distinct nonoverlapping clusters. To provide a variety of clustering solutions, the data sets were analyzed by four hierarchical clustering methods. External criterion measures indicated excellent recovery of the true cluster structure by the methods at the correct hierarchy level. Thus, the clustering present in the data was quite strong. The simulation results for the stopping rules revealed a wide range in their ability to determine the correct number of clusters in the data. Several procedures worked fairly well, whereas others performed rather poorly. Thus, the latter group of rules would appear to have little validity, particularly for data sets containing distinct clusters. Applied researchers are urged to select one or more of the better criteria. However, users are cautioned that the performance of some of the criteria may be data dependent.The authors would like to express their appreciation to a number of individuals who provided assistance during the conduct of this research. Those who deserve recognition include Roger Blashfield, John Crawford, John Gower, James Lingoes, Wansoo Rhee, F. James Rohlf, Warren Sarle, and Tom Soon.  相似文献   

2.
Perhaps the most common criterion for partitioning a data set is the minimization of the within-cluster sums of squared deviation from cluster centroids. Although optimal solution procedures for within-cluster sums of squares (WCSS) partitioning are computationally feasible for small data sets, heuristic procedures are required for most practical applications in the behavioral sciences. We compared the performances of nine prominent heuristic procedures for WCSS partitioning across 324 simulated data sets representative of a broad spectrum of test conditions. Performance comparisons focused on both percentage deviation from the “best-found” WCSS values, as well as recovery of true cluster structure. A real-coded genetic algorithm and variable neighborhood search heuristic were the most effective methods; however, a straightforward two-stage heuristic algorithm, HK-means, also yielded exceptional performance. A follow-up experiment using 13 empirical data sets from the clustering literature generally supported the results of the experiment using simulated data. Our findings have important implications for behavioral science researchers, whose theoretical conclusions could be adversely affected by poor algorithmic performances.  相似文献   

3.
Milligan  Glenn W. 《Psychometrika》1980,45(3):325-342
An evaluation of several clustering methods was conducted. Artificial clusters which exhibited the properties of internal cohesion and external isolation were constructed. The true cluster structure was subsequently hidden by six types of error-perturbation. The results indicated that the hierarchical methods were differentially sensitive to the type of error perturbation. In addition, generally poor recovery performance was obtained when random seed points were used to start theK-means algorithms. However, two alternative starting procedures for the nonhierarchical methods produced greatly enhanced cluster recovery and were found to be robust with respect to all of the types of error examined.  相似文献   

4.
A Monte Carlo evaluation of four procedures for detecting taxonicity was conducted using artificial data sets that were either taxonic or nontaxonic. The data sets were analyzed using two of Meehl's taxometric procedures, MAXCOV and MAMBAC, Ward's method for cluster analysis in concert with the cubic clustering criterion and a latent variable mixture modeling technique. Performance of the taxometric procedures and latent variable mixture modeling were clearly superior to that of cluster analysis in detecting taxonicity. Applied researchers are urged to select from the better procedures and to perform consistency tests.  相似文献   

5.
Recent cluster analytic research with alcoholic inpatients has demonstrated the existence of several Millon Clinical Multiaxial Inventory (MCMI) clusters that appear to be consistent across different subject samples. The validity of these data would be strengthened by a statistical demonstration of the similarity of attained clusters across studies--a demonstration of concordance of subject classification across different clustering techniques on the same data set- and the inclusion of external, independent measures against which to evaluate the predictive validity of the cluster typology. We found a high level of concordance in subject classification across different clustering methods on the same data set and a high level of agreement with cluster typologies attained in previous studies. Subsequent multivariate analyses employing independent scales measuring various aspects of alcohol use confirmed differences among cluster members on perceived benefits of alcohol use and deleterious effects of alcohol use. The prominent differences in alcohol use along with a rationale for their development are discussed.  相似文献   

6.
The emergence of Gaussian model‐based partitioning as a viable alternative to K‐means clustering fosters a need for discrete optimization methods that can be efficiently implemented using model‐based criteria. A variety of alternative partitioning criteria have been proposed for more general data conditions that permit elliptical clusters, different spatial orientations for the clusters, and unequal cluster sizes. Unfortunately, many of these partitioning criteria are computationally demanding, which makes the multiple‐restart (multistart) approach commonly used for K‐means partitioning less effective as a heuristic solution strategy. As an alternative, we propose an approach based on iterated local search (ILS), which has proved effective in previous combinatorial data analysis contexts. We compared multistart, ILS and hybrid multistart–ILS procedures for minimizing a very general model‐based criterion that assumes no restrictions on cluster size or within‐group covariance structure. This comparison, which used 23 data sets from the classification literature, revealed that the ILS and hybrid heuristics generally provided better criterion function values than the multistart approach when all three methods were constrained to the same 10‐min time limit. In many instances, these differences in criterion function values reflected profound differences in the partitions obtained.  相似文献   

7.
Brusco MJ 《心理学方法》2004,9(4):510-523
A number of important applications require the clustering of binary data sets. Traditional nonhierarchical cluster analysis techniques, such as the popular K-means algorithm, can often be successfully applied to these data sets. However, the presence of masking variables in a data set can impede the ability of the K-means algorithm to recover the true cluster structure. The author presents a heuristic procedure that selects an appropriate subset from among the set of all candidate clustering variables. Specifically, this procedure attempts to select only those variables that contribute to the definition of true cluster structure while eliminating variables that can hide (or mask) that true structure. Experimental testing of the proposed variable-selection procedure reveals that it is extremely successful at accomplishing this goal.  相似文献   

8.
Eight humans participated in a two-choice signal-detection task in which stimulus disparity was varied over four levels. Two procedures arranged asymmetrical numbers of reinforcers received for correct left- and right-key responses (the reinforcer ratio). The controlled procedure ensured that the obtained reinforcer ratio remained constant over changes in stimulus disparity, irrespective of subjects' performances. In the uncontrolled procedure, the asymmetrical reinforcer ratio could covary with subjects' performances. The receiver operating characteristic (ROC) patterns obtained from the controlled procedure approximated isobias functions predicted by criterion location measures of bias. The uncontrolled procedure produced variable ROC patterns that were somewhat like the isobias predictions made by likelihood ratio measures of bias; however, the obtained reinforcer ratio became more extreme as discriminability decreased. The obtained pattern of bias was directly related to the obtained reinforcer ratio. This research indicates that criterion location measures seem to be preferable indices of response bias.  相似文献   

9.
The Youth Psychopathic Traits Inventory-Short Version (YPI-S; van Baardewijk et al., 2010) is a self-report measure to assess psychopathic-like traits in adolescents. The aim of the present study is to investigate the factor structure, the internal consistency, and the criterion validity of the YPI-S in 768 Belgian community adolescents (45.4?% males). In general, our study supported the YPI three factor structure while relevant indices showed that the instrument is internally consistent. In addition, relations between the YPI-S total score and dimension scores on the one hand and external criterion measures (e.g. conduct problems and self-reported offending) on the other hand were generally in line with predictions. The present study replicated and substantially extended previous findings of the YPI-S in a sample of community youth. Future studies are needed to test whether findings from community samples can be replicated in clinical-referred and justice-involved boys and adolescents.  相似文献   

10.
Additive clustering provides a conceptually simple and potentially powerful approach to modeling the similarity relationships between stimuli. The ability of additive clustering models to accommodate similarity data, however, typically arises through the incorporation of large numbers of parameterized clusters. Accordingly, for the purposes of both model generation and model comparison, it is necessary to develop quantitative evaluative measures of additive clustering models that take into account both data-fit and complexity. Using a previously developed probabilistic formulation of additive clustering, the Bayesian Information Criterion is proposed for this role, and its application demonstrated. Limitations inherent in this approach, including the assumption that model complexity is equivalent to cluster cardinality, are discussed. These limitations are addressed by applying the Laplacian approximation of a marginal probability density, from which a measure of cluster structure complexity is derived. Using this measure, a preliminary investigation is made of the various properties of cluster structures that affect additive clustering model complexity. Among other things, these investigations show that, for a fixed number of clusters, a model with a strictly nested cluster structure is the least complicated, while a model with a partitioning cluster structure is the most complicated. Copyright 2001 Academic Press.  相似文献   

11.
A split-sample replication stopping rule for hierarchical cluster analysis is compared with the internal criterion previously found superior by Milligan and Cooper (1985) in their comparison of 30 different procedures. The number and extent of overlap of the latent population distributions was systematically varied in the present evaluation of stopping-rule validity. Equal and unequal population base rates were also considered. Both stopping rules correctly identified the actual number of populations when there was essentially no overlap and clusters occupied visually distinct regions of the measurement space. The replication criterion, which is evaluated by clustering of cluster means from preliminary analyses that are accomplished on random partitions of an original data set, was superior as the degree of overlap in population distributions increased. Neither method performed adequately when overlap obliterated visually discernible density nodes.This research was supported in part by NIMH grant 5R01 MH 32457 14.  相似文献   

12.
Preference data, such as Likert scale data, are often obtained in questionnaire-based surveys. Clustering respondents based on survey items is useful for discovering latent structures. However, cluster analysis of preference data may be affected by response styles, that is, a respondent's systematic response tendencies irrespective of the item content. For example, some respondents may tend to select ratings at the ends of the scale, which is called an ‘extreme response style’. A cluster of respondents with an extreme response style can be mistakenly identified as a content-based cluster. To address this problem, we propose a novel method of clustering respondents based on their indicated preferences for a set of items while correcting for response-style bias. We first introduce a new framework to detect, and correct for, response styles by generalizing the definition of response styles used in constrained dual scaling. We then simultaneously correct for response styles and perform a cluster analysis based on the corrected preference data. A simulation study shows that the proposed method yields better clustering accuracy than the existing methods do. We apply the method to empirical data from four different countries concerning social values.  相似文献   

13.
Present optimization techniques in latent class analysis apply the expectation maximization algorithm or the Newton-Raphson algorithm for optimizing the parameter values of a prespecified model. These techniques can be used to find maximum likelihood estimates of the parameters, given the specified structure of the model, which is defined by the number of classes and, possibly, fixation and equality constraints. The model structure is usually chosen on theoretical grounds. A large variety of structurally different latent class models can be compared using goodness-of-fit indices of the chi-square family, Akaike’s information criterion, the Bayesian information criterion, and various other statistics. However, finding the optimal structure for a given goodness-of-fit index often requires a lengthy search in which all kinds of model structures are tested. Moreover, solutions may depend on the choice of initial values for the parameters. This article presents a new method by which one can simultaneously infer the model structure from the data and optimize the parameter values. The method consists of a genetic algorithm in which any goodness-of-fit index can be used as a fitness criterion. In a number of test cases in which data sets from the literature were used, it is shown that this method provides models that fit equally well as or better than the models suggested in the original articles.  相似文献   

14.
A mathematical model for the analysis of category clustering is developed and testd. The model, which can be applied to categories of any size, is an extension of a two-item statistical model developed by Batchelder and Riefer (Psychological Review, 1980, 87, 375–397), and is equivalent to their model when categories consist of two items. The model is based on a current theory of clustering which postulates that the learning of a list of category items occurs on different hierarchical levels. Two category list-learning experiments are presented, and the data from these experiments are analyzed using the general statistical model. The first experiment reveals that the probabilities of storing and retrieving a cluster increase with category size, while the learning of items as singletons decreases. The effects of within-category spacing indicate that the storage of clusters decreases while cluster retrievability increases with an increase in input spacing. In the second experiment, the storage and retrieval of clusters are shown to be unaffected by whether the presentation of items is uncued or cued with the name of the category. However, the association of items decreases and the learning of items as singletons increases with uncued presentation. In the final sections, the general statistical model is compared to other methods for the measurement of category clustering. The model is shown to be superior to numerical indices of clustering, since these measures are not based on any theory of clustering, and because unitary measures cannot capture the multiprocess nature of categorized recall. The model is also argued to have certain advantages over other mathematical models that have been applied to category clustering, since these models cannot account for situations in which a portion of the items are clustered while others are learned singularly.  相似文献   

15.
Clusterwise linear regression is a multivariate statistical procedure that attempts to cluster objects with the objective of minimizing the sum of the error sums of squares for the within-cluster regression models. In this article, we show that the minimization of this criterion makes no effort to distinguish the error explained by the within-cluster regression models from the error explained by the clustering process. In some cases, most of the variation in the response variable is explained by clustering the objects, with little additional benefit provided by the within-cluster regression models. Accordingly, there is tremendous potential for overfitting with clusterwise regression, which is demonstrated with numerical examples and simulation experiments. To guard against the misuse of clusterwise regression, we recommend a benchmarking procedure that compares the results for the observed empirical data with those obtained across a set of random permutations of the response measures. We also demonstrate the potential for overfitting via an empirical application related to the prediction of reflective judgment using high school and college performance measures.  相似文献   

16.
Answer similarity indices were developed to detect pairs of test takers who may have worked together on an exam or instances in which one test taker copied from another. For any pair of test takers, an answer similarity index can be used to estimate the probability that the pair would exhibit the observed response similarity or a greater degree of similarity under the assumption that the test takers worked independently. To identify groups of test takers with unusually similar response patterns, Wollack and Maynes suggested conducting cluster analysis using probabilities obtained from an answer similarity index as measures of distance. However, interpretation of results at the cluster level can be challenging because the method is sensitive to the choice of clustering procedure and only enables probabilistic statements about pairwise relationships. This article addresses these challenges by presenting a statistical test that can be applied to clusters of examinees rather than pairs. The method is illustrated with both simulated and real data.  相似文献   

17.
This article is concerned with procedures for determining the number of clusters in a data set. Most of the procedures or stopping rules currently in use involve finding internally coherent and externally isolated clusters, but do not derive from the formal structure of the respective clustering model. Based on the graph theoretic concepts of minimal spanning tree, maximal spanning tree, and homomorphic function, a new criterion is advanced that yields a well-defined clustering solution. Its performance in determining the number of clusters in several empirical data sets is evaluated by comparing it to four prominent stopping rules. It is shown that the proposed criterion not only possesses mathematically attractive properties but also may contribute to solving the number-of-clusters problem.  相似文献   

18.
A general procedure that can be applied to any type of free recall data for the quantitative measurement of subjective organization intertrial repetitions (ITRs) is described. The procedure makes it possible to examine any unit size with any type of internal consistency criterion. The application of this procedure and the corresponding expected value formula to four distinct data measurement cases is described, and several applied subjective organization measures are discussed. In addition, the relationship of clustering as a special case of subjective organization is demonstrated.  相似文献   

19.
Minimization of the within-cluster sums of squares (WCSS) is one of the most important optimization criteria in cluster analysis. Although cluster analysis modules in commercial software packages typically use heuristic methods for this criterion, optimal approaches can be computationally feasible for problems of modest size. This paper presents a new branch-and-bound algorithm for minimizing WCSS. Algorithmic enhancements include an effective reordering of objects and a repetitive solution approach that precludes the need for splitting the data set, while maintaining strong bounds throughout the solution process. The new algorithm provided optimal solutions for problems with up to 240 objects and eight well-separated clusters. Poorly separated problems with no inherent cluster structure were optimally solved for up to 60 objects and six clusters. The repetitive branch-and-bound algorithm was also successfully applied to three empirical data sets from the classification literature.  相似文献   

20.
Much of the research on psychopathy has treated it as a unitary construct operationalized by total scores on one (or more) measures. More recent studies on the Psychopathic Personality Inventory (PPI) suggest the existence of two distinct facets of psychopathy with unique external correlates. Here, the authors report reanalyses of two offender data sets that included scores on the PPI along with various theoretically relevant criterion variables. Consistent with hypotheses, the two PPI factors showed convergent and discriminant relations with criterion measures, many of which would otherwise have been obscured when relying on PPI total scores. These results highlight the importance of examining facets of psychopathy as well as total scores.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号