首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 46 毫秒
The kappa coefficient is one of the most widely used measures for evaluating the agreement between two raters asked to assign N objects to one of K nominal categories. Weighted versions of kappa enable partial credit to be awarded for near agreement, most notably in the case of ordinal categories. An exact significance test for weighted kappa can be conducted by enumerating all rater agreement tables with the same fixed marginal frequencies as the observed table, and accumulating the probabilities for all tables that produce a weighted kappa index that is greater than or equal to the observed measure. Unfortunately, complete enumeration of all tables is computationally unwieldy for modest values of N and K. We present an implicit enumeration algorithm for conducting an exact test of weighted kappa, which can be applied to tables of non‐trivial size. The algorithm is particularly efficient for ‘good’ to ‘excellent’ values of weighted kappa that typically have very small p‐values. Therefore, our method is beneficial for situations where resampling tests are of limited value because the number of trials needed to estimate the p‐value tends to be large.  相似文献   

Some Paradoxical Results for the Quadratically Weighted Kappa   总被引:1,自引:0,他引:1  
The quadratically weighted kappa is the most commonly used weighted kappa statistic for summarizing interrater agreement on an ordinal scale. The paper presents several properties of the quadratically weighted kappa that are paradoxical. For agreement tables with an odd number of categories n it is shown that if one of the raters uses the same base rates for categories 1 and n, categories 2 and n−1, and so on, then the value of quadratically weighted kappa does not depend on the value of the center cell of the agreement table. Since the center cell reflects the exact agreement of the two raters on the middle category, this result questions the applicability of the quadratically weighted kappa to agreement studies. If one wants to report a single index of agreement for an ordinal scale, it is recommended that the linearly weighted kappa instead of the quadratically weighted kappa is used.  相似文献   

This paper demonstrates and compares methods for estimating the interrater reliability and interrater agreement of performance ratings. These methods can be used by applied researchers to investigate the quality of ratings gathered, for example, as criteria for a validity study, or as performance measures for selection or promotional purposes. While estimates of interrater reliability are frequently used for these purposes, indices of interrater agreement appear to be rarely reported for performance ratings. A recommended index of interrater agreement, theT index (Tinsley & Weiss, 1975), is compared to four methods of estimating interrater reliability (Pearsonr, coefficient alpha, mean correlation between raters, and intraclass correlation). Subordinate and superior ratings of the performance of 100 managers were used in these analyses. The results indicated that, in general, interrater agreement and reliability among subordinates were fairly high. Interrater agreement between subordinates and superiors was moderately high; however, interrater reliability between these two rating sources was very low. The results demonstrate that interrater agreement and reliability are distinct indices and that both should be reported. Reasons are discussed as to why interrater reliability should not be reported alone.This paper is based, in part, on a thesis submitted to East Carolina University by the second author. Portions of this study were presented at the American Psychological Association meeting in New Orleans, LA, August, 1989. The authors would like to thank Michael Campion and two anonymous reviewers for their comments on earlier drafts of this paper.  相似文献   

The meaning and properties of a commonly used index of reliability, S/L,were examined critically. It was found that the index does not reflect any conventional concept of reliability. When used for an identical behavioral observation session, it is not statistically correlated with other reliability indices. Within an observation session, the standardizing measure of Lis beyond the control of the investigator. Furthermore, the reason for the choice of Las the standard is unclear. The role of chance agreement in S/Lis not known. The exact interpretation of the index depends on which observer reports L.Overall the conceptual and mathematical meaning of S/Lis dubious. It is suggested that the S/Lindex should not be used until its nature is shown to be a measure of reliability. Other approaches such as the intraclass correlations and generalizability coefficients should be used instead.The authors are indebted to Johnny Matson for his critique of an earlier version of this paper.  相似文献   

The percentage agreement index has been and continues to be a popular measure of interobserver reliability in applied behavior analysis and child development, as well as in other fields in which behavioral observation techniques are used. An algebraic method and a linear programming method were used to assess chance-corrected reliabilities for a sample of past observations in which the percentage agreement index was used. The results indicated that, had kappa been used instead of percentage agreement, between one-fourth and three-fourth of the reported observations could be judged as unreliable against a lenient criterion and between one-half and three-fourths could be judged as unreliable against a more stringent criterion. It is suggested that the continued use of the percentage agreement index has seriously undermined the reliabilities of past observations and can no longer be justified in future studies.  相似文献   

Two predictions arising from previous theoretical and empirical work which demonstrated that spontaneous changes of bimanual coordination patterns result from a loss of pattern stability (i.e., a nonequilibrium phase transition) were tested: (a) that the time it takes to intentionally switch from one pattern to another depends on the differential stability of the patterns themselves; and (b) that an intention, defined as an intended behavioral pattern, can change the dynamical characteristics, e.g., the overall stability of the behavioral patterns. Subjects moved both index fingers rhythmically at one of six movement frequencies while performing either an in-phase or antiphase pattern of finger coordination. On cue from an auditory signal, subjects switched from the ongoing pattern to the other pattern. The relative phase of movement between the two fingers was used to characterize the ongoing coordinative pattern. The time taken to switch between patterns, or switching time, and relative phase fluctuations were used to evaluate the modified pattern dynamics resulting from a subject's intention to change patterns. Switching from the in-phase to the antiphase pattern was significantly slower than switching in the opposite direction for all subjects. Both the mean and distribution of switching times in each direction were found to be in agreement with model predictions. Movement frequency had little effect on switching time, a finding that is also consistent with the model. Relative phase fluctuations were significantly larger when moving in the antiphase pattern at the highest movement frequencies studied. The results show that, although intentional influences act to modify a coordinative pattern's intrinsic dynamics, the influence of these dynamics on the resulting behavior is always present and is particularly strong at high movement frequencies.  相似文献   

Does producing syntactic agreement rely on syntactic or memory-based retrieval processes? The present study investigated the extent to which syntactic processing deficits and working memory (WM) deficits predict susceptibility to agreement attraction [Bock, K., &; Miller, C. A. (1991). Broken agreement. Cognitive Psychology, 23, 45–93], where speakers tend to erroneously produce plural agreement for a singular subject when another noun in the sentence is grammatically plural. Four brain-injured patients with varying degrees of grammatical and WM deficits completed sentences with local nouns that matched or mismatched in number with the head noun, and that were plausible or implausible subjects. Both aspects of grammatical deficits and the extent of WM deficits predicted the extent of agreement attraction effects. These data are consistent with the proposal that producing an agreeing verb involves a cue-based search in WM for an appropriate controlling noun, which is subject to interference from other elements in memory with similar properties [cf. Badecker, W., &; Kuminiak, F. (2007). Morphology, agreement and working memory retrieval in sentence production: Evidence from gender and case in Slovak. Journal of Memory and Language, 56(1), 65–85. doi:10.1016/j.jml.2006.08.004].  相似文献   

Two studies were conducted to determine if lay judges could accurately assess another individual’s integrity level when using overt and covert integrity inventories. In Study 1, participants took part in simulated employment interviews and then both the participants and lay interviewers completed an overt integrity test comparable to the Reid Report integrity survey [Reid London House. (2004). Abbreviated Reid Report. Minneapolis, MN: NCS Pearson]. Self-lay judge agreement and peer-lay judge agreement were used as the criteria for accuracy. In Study 2, participants took part in either a simulated structured, unstructured or informal employment interview format and completed both overt and covert integrity inventories. The results suggest that lay judges (as well as acquaintances) are fairly accurate in assessing others’ integrity levels based upon a very brief 10-min interaction with an individual, when using either an overt or covert integrity inventory. The findings also suggest that the informal interview format can significantly enhance the accuracy of a lay-judge’s assessment of the participant’s integrity level when a covert measure of integrity is used.  相似文献   

Brunswik's lens model has been widely used in the modeling and analysis of judgment tasks and the lens model equation has been an important part of most analyses. In such analyses, researchers often have based arguments and interpretations upon the magnitude of various components of the lens model equation. One component is G, an index of agreement between a linear model of the subject's judgments and a linear model of the task environment. This paper addresses the index G and shows that while in principle −1 G 1, the distribution of G is often skewed toward high values so that caution must be used in the interpretation of its magnitude. Moreover, the results reported here apply to the relation between linear models in general.  相似文献   

To test the agreement between two observers who categorize a number of objects when the categories have not been specified in advance, Brennan & Light (1974) developed a statistic A′ and suggested a normal approximation for its distribution. In this paper it is shown that this approximation is inadequate particularly when one, or both, of the observers place a fairly equal number of objects in all of their categories. A chi-squared approximation to the distribution of A′ is developed and is shown to work well in a variety of situations. The relative powers of A′ and the ordinary X2 test for association are dependent on the type of ‘agreement between the observers’ that is assumed. However a simulation for a fairly general type of agreement indicates that the X2 test is more powerful. As the X2 test is also much easier to apply, it would seem preferable in most situations.  相似文献   

Answer similarity indices were developed to detect pairs of test takers who may have worked together on an exam or instances in which one test taker copied from another. For any pair of test takers, an answer similarity index can be used to estimate the probability that the pair would exhibit the observed response similarity or a greater degree of similarity under the assumption that the test takers worked independently. To identify groups of test takers with unusually similar response patterns, Wollack and Maynes suggested conducting cluster analysis using probabilities obtained from an answer similarity index as measures of distance. However, interpretation of results at the cluster level can be challenging because the method is sensitive to the choice of clustering procedure and only enables probabilistic statements about pairwise relationships. This article addresses these challenges by presenting a statistical test that can be applied to clusters of examinees rather than pairs. The method is illustrated with both simulated and real data.  相似文献   

Agreements and disagreements between expert statements influence lay people's beliefs. But few studies have examined what is perceived as a disagreement. We report six experiments where people rated agreement between pairs of probabilistic statements about environmental events, attributed to two different experts or to the same expert at two different points in time. The statements differed in frame, by focusing on complementary outcomes (45% probability that smog will have negative health effects vs. 55% probability that it will not have such effects), in probability level (45% vs. 55% probability of negative effects), or in both respects. Opposite frames strengthened disagreement when combined with different probability levels. Approximate probabilities can be “framed” in yet another way by indicating reference values they are “over” or “under”. Statements that use different directional verbal terms (over vs. under 50%) indicated greater disagreement than statements with the same directional term but different probability levels (over 50% vs. over 70%). Framing and directional terms similarly affected consistency judgments when both statements were issued by the same expert at different occasions. The effect of framing on perceived agreement was significant for medium (10 and 20 percentage points) differences between probabilities, whereas the effect of directional term was stable for numerical differences up to 40 percentage points. To emphasize agreement between different estimates, they should be framed in the same way. To accentuate disagreements or changes of opinion, opposite framings should be used.  相似文献   

We propose a model to measure risk in a prisoner's dilemma based on Coombs' (1973) re‐parameterization of the game as an individual risk decision‐making task that chooses between a gamble of cooperation and another gamble of defection. Specifically, we propose an index, r, to represent the risk associated with cooperation relative to defection. In conjunction with Rapoport's (1967) index of cooperation (K), our formulation of risk allows us to construct games that vary in risk (as indexed by r) while controlling for cooperativeness (as indexed by K). Following utility analysis that models risk seeking as a convex utility function and risk averse as a concave function, we predict that risk‐seeking people cooperate more in games that the cooperation choice is more risky, whereas risk‐averse people cooperate more in games that the cooperation choice is less risky. In the three studies that we varied game parameters, used different measures of risk orientation and prosocial orientation and used different experimental procedures, we found robust results supporting our predictions. Theoretical analysis of our formulation further suggests that risk and cooperativeness of a prisoner's dilemma game is not entirely independent. Games that have a higher cooperativeness index are necessarily more risky. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

Mothers' and fathers' responses on the Personality Inventory for Children (PIC) were compared for 360 clinic-referred children and adolescents. Interparent agreement was measured by several different indices. Correlations between parental pairs of scale T-scores derived from each parent averaged .66 for 13 of the profile scales; 9 of the scales exceeded this value. In contrast, mothers and fathers agreed in the classification of the presence of clinical significance an average of 77% of the time across these 13 scales, and pairs of parental PIC profiles contained an average of 3 scales in disagreement. The type of index used to measure interparent agreement was found to affect the results. The discussion examines the nature of interparent disagreement and addresses the implications regarding the use of fathers as informants on this instrument.This article is based on Thomas A. Hulbert's master's thesis. The authors extend their appreciation to James Orisell and Dena Mussaff for their assistance in data preparation and analysis.  相似文献   

The Hand Test is a projective technique yielding an Acting Out Score (AOS) which the test authors feel is useful as a predictor of aggressive, acting-out behavior. This study produced data regarding the stability of AOS, the relation of AOS to another projective device used to assess aggressive potential and the ability of AOS to predict teacher ratings of acting-out behavior among emotionally disturbed pre-adolescents. Results indicate that for this sample of Ss the AOS lacks stability as a measurement construct, does not correlate with another projective measure of aggressive potential and is not a useful predictor of acting-out behavior as rated by teachers of emotionally disturbed pre-adolescents.  相似文献   

Dimensions underlying the definition of items as feminine and masculine were examined in a set of three studies. Items chosen by children as belonging to males or females were used as the initial stimuli. These included traditionally stereotyped items such as a hammer and an iron, as well as more metaphorically related items such as bears and flowers. The raters in all three studies were undergraduates (70% White, 30% minorities). In Study 1, the items were rated using a set of 40 common adjectives. Three factors resulted: two related to masculine items and one to feminine items. In Study 2, a subset of the adjectives were used to rate abstract paintings that had been designated feminine or masculine by another group of adults. In Study 3, a set of stimuli were developed using the adjectives from the previous two studies. The items were rated as feminine or masculine and matched the initial coding of the adjective. The new items were also rated on the same adjectives by another set of adults. Again, the masculine adjectives were assigned to masculine items and feminine to feminine items. There was excellent agreement across three different sets of stimuli on the underlying dimensions of gender definition, even using items that were not traditionally stereotyped.  相似文献   



A contributing reason for the common problem of missing middle anchors on behaviorally anchored rating scales (BARS) is the standard deviation (SD) criterion used in scaling phase. An alternative BARS scaling process is proposed based on the a wg(1) index of interrater agreement.  相似文献   

When determining whether a particular transition is more characteristic of one group than of another, two things are required: an index associated with the transition of interest and a statistical test that can determine whether group membership systematically affects values for that index. Here the familiar parametrict test is compared with a test based on sampled permutations. Indices considered are the odds and log odds ratio, Yule’sQ, Wampold’s (1989) transformed kappa, and phi. The odds and log odds ratio are monotonically increasing functions of Yule’sQ and so give similar results. Yule’sQ and phi are essentially rank order invariant and usually give similar results. Transformed kappa, however, rank orders subjects somewhat differently than the others; moreover, it appears somewhat biased. With respect to the tests, when subjects are 20 or more it does not matter much whether sampled permutation or parametrict tests are used; both yield essentially the same result. However, when subjects are fewer than 20, or whenever there is any other reason to think that parametric assumptions may not be met, permutation tests are recommended. A computer program that effects such tests is described.  相似文献   

A survey of residual analysis in behavior‐analytic research reveals that existing methods are problematic in one way or another. A new test for residual trends is proposed that avoids the problematic features of the existing methods. It entails fitting cubic polynomials to sets of residuals and comparing their effect sizes to those that would be expected if the sets of residuals were random. To this end, sampling distributions of effect sizes for fits of a cubic polynomial to random data were obtained by generating sets of random standardized residuals of various sizes, n. A cubic polynomial was then fitted to each set of residuals and its effect size was calculated. This yielded a sampling distribution of effect sizes for each n. To test for a residual trend in experimental data, the median effect size of cubic‐polynomial fits to sets of experimental residuals can be compared to the median of the corresponding sampling distribution of effect sizes for random residuals using a sign test. An example from the literature, which entailed comparing mathematical and computational models of continuous choice, is used to illustrate the utility of the test.  相似文献   

Item-analysis data are usually obtained from a single test administration, with a given item sequence and time limit. Questions can be raised as to the effects upon item data resulting from changes in item-position and test-timing. In this study, two forms of a verbal test and two forms of a mathematics test were used. In each case, both forms of each test contained the same items, but items coming early in one form were placed late in the other. Each of these forms was administered once with a short time limit and once with generous timing to comparable groups of high school students. The relationships of various speed and power scores were determined, and the changes which occurred during the added time were studied. Values of the item indicesp (proportion right), (another difficulty index), and the item-test biserial correlation coefficient were obtained for both the speed and the power conditions and were systematically compared. The proportion right of those attempting the item, the index, and the biserialr were all found to have undesirable characteristics for items appearing late in a speeded test.The author gratefully acknowledges the suggestions and criticisms of Dr. Harold Gulliksen, Research Adviser at the Educational Testing Service.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号