首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The kappa coefficient is one of the most widely used measures for evaluating the agreement between two raters asked to assign N objects to one of K nominal categories. Weighted versions of kappa enable partial credit to be awarded for near agreement, most notably in the case of ordinal categories. An exact significance test for weighted kappa can be conducted by enumerating all rater agreement tables with the same fixed marginal frequencies as the observed table, and accumulating the probabilities for all tables that produce a weighted kappa index that is greater than or equal to the observed measure. Unfortunately, complete enumeration of all tables is computationally unwieldy for modest values of N and K. We present an implicit enumeration algorithm for conducting an exact test of weighted kappa, which can be applied to tables of non‐trivial size. The algorithm is particularly efficient for ‘good’ to ‘excellent’ values of weighted kappa that typically have very small p‐values. Therefore, our method is beneficial for situations where resampling tests are of limited value because the number of trials needed to estimate the p‐value tends to be large.  相似文献   

2.
Pingke Li 《Psychometrika》2016,81(3):795-801
The linearly and quadratically weighted kappa coefficients are popular statistics in measuring inter-rater agreement on an ordinal scale. It has been recently demonstrated that the linearly weighted kappa is a weighted average of the kappa coefficients of the embedded 2 by 2 agreement matrices, while the quadratically weighted kappa is insensitive to the agreement matrices that are row or column reflection symmetric. A rank-one matrix decomposition approach to the weighting schemes is presented in this note such that these phenomena can be demonstrated in a concise manner.  相似文献   

3.
There is a frequent need to measure the degree of agreement among R observers who independently classify n subjects within K nominal or ordinal categories. The most popular methods are usually kappa-type measurements. When = 2, Cohen's kappa coefficient (weighted or not) is well known. When defined in the ordinal case while assuming quadratic weights, Cohen's kappa has the advantage of coinciding with the intraclass and concordance correlation coefficients. When > 2, there are more discrepancies because the definition of the kappa coefficient depends on how the phrase ‘an agreement has occurred’ is interpreted. In this paper, Hubert's interpretation, that ‘an agreement occurs if and only if all raters agree on the categorization of an object’, is used, which leads to Hubert's (nominal) and Schuster and Smith's (ordinal) kappa coefficients. Formulae for the large-sample variances for the estimators of all these coefficients are given, allowing the latter to illustrate the different ways of carrying out inference and, with the use of simulation, to select the optimal procedure. In addition, it is shown that Schuster and Smith's kappa coefficient coincides with the intraclass and concordance correlation coefficients if the first coefficient is also defined assuming quadratic weights.  相似文献   

4.
Cohen’s Linearly Weighted Kappa is a Weighted Average of 2×2 Kappas   总被引:1,自引:0,他引:1  
An agreement table with n∈ℕ≥3 ordered categories can be collapsed into n−1 distinct 2×2 tables by combining adjacent categories. Vanbelle and Albert (Stat. Methodol. 6:157–163, 2009c) showed that the components of Cohen’s weighted kappa with linear weights can be obtained from these n−1 collapsed 2×2 tables. In this paper we consider several consequences of this result. One is that the weighted kappa with linear weights can be interpreted as a weighted arithmetic mean of the kappas corresponding to the 2×2 tables, where the weights are the denominators of the 2×2 kappas. In addition, it is shown that similar results and interpretations hold for linearly weighted kappas for multiple raters.  相似文献   

5.
When two (or more) observers are independently categorizing a set of observations, Cohen’s kappa has become the most notable measure of interobserver agreement. When the categories are ordinal, a weighted form of kappa becomes desirable. The two most popular weighting schemes are the quadratic weights and linear weights. Quadratic weights have been justified by the fact that the corresponding weighted kappa is asymptotically equivalent to an intraclass correlation coefficient. This paper deals with linear weights and shows that the corresponding weighted kappa is equivalent to the unweighted kappa when cumulative probabilities are substituted for probabilities. A numerical example is provided.  相似文献   

6.
This article describes a new measure of dispersion as an indication of consensus and dissention. Building on the generally accepted Shannon entropy, this measure utilizes a probability distribution and the ordered ranking of categories in an ordinal scale distribution to yield a value confined to the unit interval. Unlike other measures that need to be normalized, this measure is always in the interval 0 to 1. The measure is typically applied to the Likert scale to determine degrees of agreement among ordinal-ranked categories when one is dealing with data collection and analysis, although other scales are possible. Using this measure, investigators can easily determine the proximity of ordinal data to consensus (agreement) or dissention. Consensus and dissention are defined relative to the degree of proximity of values constituting a frequency distribution on the ordinal scale measure. The authors identify a set of criteria that a measure must satisfy in order to be an acceptable indicator of consensus and show how the consensus measure satisfies all the criteria.  相似文献   

7.
The exact variance of weighted kappa with multiple raters   总被引:1,自引:0,他引:1  
Weighted kappa described by Cohen in 1968 is widely used in psychological research to measure agreement between two independent raters. Everitt then provided the exact variance for weighted kappa for two raters. In this paper, Everitt's exact variance is extended to three or more raters.  相似文献   

8.
Little attention has been paid to evaluating the use of DSM-III-R with preschool children. Children (N = 510) ages 2 to 5 years who were screened at the time of a pediatric visit were selected to participate in an evaluation which included questionnaires, a semistructured interview, developmental testing, and a play observation. Following the evaluation, two clinical child psychologists independently assigned DSM-III-R diagnoses. For each diagnostic category, kappa and Ycoefficients were calculated; Ycoefficients are less sensitive to base rates of disorders. For overall agreement, the weighted mean kappa (.61), and mean Y(.66) were moderately high. Overall agreement that the child had at least one of the disruptive disorders was substantial (kappa =.64; Y =.65);agreement that there was at least one of the emotional disorders was moderate for kappa (.54), but substantial for Y(.70). Kappa coefficients were higher for major categories of disorder than for specific disorders; however, Ycoefficients did not show a decline for specific disorders. Interrater reliability of DSM-III-R appears to be similar for preschoolers and older children.This study was supported by grant MH46089 from the National Institute of Mental Health.A preliminary report was presented at the Fifth Annual NIMH International Research Conference on the Classification and Treatment of Mental Disorders in General Medical Settings, Bethesda, Maryland, September 1991. We gratefully acknowledge the members of the Pediatric Practice Research Group who participated in this study.  相似文献   

9.
Pi (π) and kappa (κ) statistics are widely used in the areas of psychiatry and psychological testing to compute the extent of agreement between raters on nominally scaled data. It is a fact that these coefficients occasionally yield unexpected results in situations known as the paradoxes of kappa. This paper explores the origin of these limitations, and introduces an alternative and more stable agreement coefficient referred to as the AC1 coefficient. Also proposed are new variance estimators for the multiple‐rater generalized π and AC1 statistics, whose validity does not depend upon the hypothesis of independence between raters. This is an improvement over existing alternative variances, which depend on the independence assumption. A Monte‐Carlo simulation study demonstrates the validity of these variance estimators for confidence interval construction, and confirms the value of AC1 as an improved alternative to existing inter‐rater reliability statistics.  相似文献   

10.
Resampling probability values for weighted kappa with multiple raters   总被引:1,自引:0,他引:1  
A new procedure to compute weighted kappa with multiple raters is described. A resampling procedure to compute approximate probability values for weighted kappa with multiple raters is presented. Applications of weighted kappa are illustrated with an example analysis of classifications by three independent raters.  相似文献   

11.
This study investigated the short-term stability of the 1991 Mirowsky-Ross 2 x 2 Index of the Sense of Control. From an ongoing longitudinal study, 304 subjects were randomly selected for test-retest interviews occurring 1 to 4 days after their regularly scheduled first follow-up interview. Test-retest reliability was assessed at the item level using percent agreement and weighted kappa. At the scale score level, reliability was assessed with the intraclass correlation coefficient (ICC). ICCs were also calculated within categories of demographic, socioeconomic, psychosocial, and functional status characteristics. There was moderate to substantial item-level agreement (mean weighted kappa = 51; weighted kappa range = .38 to .66). At the scale score level there was substantial agreement (ICC = .71). No appreciable differences in ICC values were found in the demographic, socioeconomic, psychosocial, and functional comparisons of status characteristics. Thus, this sense of control measure has acceptable test-retest reliability and is appropriate for use in longitudinal research.  相似文献   

12.
The purpose of this study was to investigate the effects of different types and magnitudes of serial dependence (first-order moving average and autoregression) and of linear regression lines within experimental phases on the agreement between results of visual and results of statistical data analyses. The stimulus material consisted of computer-simulated A-B-design data graphs. The time series were generated with a constant variance, varying degrees of treatment effects (changes in level), five conditions of serial dependency, and with or without linear regression lines. The material was presented to three groups of student raters (n1=52, n2=14, n3=17) who rated the treatment effect in the graphs on a five-point scale. These ratings were compared with statistical results (time-series analyses). Each group had to interpret 70 graphs, 35 of which had regression lines. Data were analyzed by means of two three-factor and one four-factor ANOVA and by graphic display. The linear regression lines generally enhanced the agreement between the raters' estimations and the statistical results. Serial dependency also increased the agreement between the two analysis methods. However, with strong autoregression processes in the data, the raters tended to overestimate treatment effects relative to time-series analysis.Parts of this study were presented at the World Congress on Behavior Therapy, Washington, DC, December 11, 1983. The authors wish to express their appreciation to Christoph Bonk and Willi Ecker for their extensive collaboration in data analysis and for their assistance in carrying out the study.  相似文献   

13.
The study tested the hypothesis that with respect to the big five domains associated with temperament, agreement between self- and others' ratings is higher than with respect to other domains. The same was expected with respect to peer–peer agreement. There were two groups of subjects: self-raters (n=639) and peer-raters (n=1278). All subjects completed the Polish Adjective List (PAL), which consists of five scales: Dynamism, Conscientiousness, Agreeableness, Excitability and Intellect, which are Polish representations of the big five personality factors extracted in American lexical studies. Each target person completed one self-rating inventory and was assessed by two peer-raters. Domains associated with temperament (Dynamism and Excitability) elicited higher agreement between self-and peer-ratings than Agreeableness and Intellect, although in case of Conscientiousness judges appeared to be as accurate as in the case of Excitability. The pattern was even less clear with respect to the peer–peer comparison. The other finding shows that in case of female raters there was more agreement between self- and peer-rating, than in case of male raters.  相似文献   

14.
Agreement between Two Independent Groups of Raters   总被引:1,自引:0,他引:1  
We propose a coefficient of agreement to assess the degree of concordance between two independent groups of raters classifying items on a nominal scale. This coefficient, defined on a population-based model, extends the classical Cohen’s kappa coefficient for quantifying agreement between two raters. Weighted and intraclass versions of the coefficient are also given and their sampling variance is determined by the Jackknife method. The method is illustrated on medical education data which motivated the research.  相似文献   

15.
New evidence-based physical activity (PA) guidelines and recommendations for constructing messages supplementing the guidelines have been put forth. As well, recent reviews have identified theoretical constructs that hold promise as targets for intervention: self-regulation, outcome expectancies and self-efficacy. The purpose of this study was to examine the integration of messages targeting self-regulation, self-efficacy and outcome expectancies in existing physical activity brochures. Twenty-two PA brochures from Canadian and American National Health Organizations were assessed for their use self-efficacy, self-regulatory processes and outcome expectancies. Brochures were analyzed line-by-line using a modified version of the validated Content Analysis Approach to Theory-Specified Persuasive Educational Communication (CAATSPEC; Abraham, Southby, Quandte, Krahé, & van der Sluijs, 2007). Two independent raters coded a third of the brochures (n = 7). Inter-rater reliability was acceptable for 17 of the 20 categories (rs > .79). Discrepancies in all categories were discussed and agreement was reached. The remaining brochures were coded by one of the two raters. Usage of the three key theoretical constructs accounted for only 36.43% of brochure content (20.23% self-efficacy, 10.40% outcome expectancies, 5.80% self-regulation). Brochures lacked the use of a variety of theoretical strategies, specifically goal-setting, planning and verbal persuasion and rarely highlighted the affective benefits of physical activity. In the future brochures should aim to place increased emphasis on self-regulation, self-efficacy, and affective outcome expectancies.  相似文献   

16.
In many psychological studies, in particular those conducted by experience sampling, mental states are measured repeatedly for each participant. Such a design allows for regression models that separate between- from within-person, or trait-like from state-like, components of association between two variables. But these models are typically designed for continuous variables, whereas mental state variables are most often measured on an ordinal scale. In this paper we develop a model for disaggregating between- from within-person effects of one ordinal variable on another. As in standard ordinal regression, our model posits a continuous latent response whose value determines the observed response. We allow the latent response to depend nonlinearly on the trait and state variables, but impose a novel penalty that shrinks the fit towards a linear model on the latent scale. A simulation study shows that this penalization approach is effective at finding a middle ground between an overly restrictive linear model and an overfitted nonlinear model. The proposed method is illustrated with an application to data from the experience sampling study of Baumeister et al. (2020, Personality and Social Psychology Bulletin, 46, 1631).  相似文献   

17.
Clusteringn objects intok groups under optimal scaling of variables   总被引:1,自引:0,他引:1  
We propose a method to reduce many categorical variables to one variable withk categories, or stated otherwise, to classifyn objects intok groups. Objects are measured on a set of nominal, ordinal or numerical variables or any mix of these, and they are represented asn points inp-dimensional Euclidean space. Starting from homogeneity analysis, also called multiple correspondence analysis, the essential feature of our approach is that these object points are restricted to lie at only one ofk locations. It follows that thesek locations must be equal to the centroids of all objects belonging to the same group, which corresponds to a sum of squared distances clustering criterion. The problem is not only to estimate the group allocation, but also to obtain an optimal transformation of the data matrix. An alternating least squares algorithm and an example are given.The authors thank Eveline Kroezen and Teije Euverman for their comments on a previous draft of this paper.  相似文献   

18.
The rater agreement literature is complicated by the fact that it must accommodate at east two different properties of rating data: the number of raters (two versus more than two) and the rating scale level (nominal versus metric). While kappa statistics are most widely used for nominal scales, intraclass correlation coefficients have been preferred for metric scales. In this paper, we suggest a dispersion-weighted kappa framework for multiple raters that integrates some important agreement statistics by using familiar dispersion indices as weights for expressing disagreement. These weights are applied to ratings identifying cells in the traditional inter-judge contingency table. Novel agreement statistics can be obtained by applying less familiar indices of dispersion in the same wayThis revised article was published online in August 2005 with the PDF paginated correctly.  相似文献   

19.
This study describes the development of a screening tool for gaming addiction in adolescents – the Gaming Addiction Identification Test (GAIT). Its development was based on the research literature on gaming and addiction. An expert panel comprising professional raters (= 7), experiential adolescent raters (= 10), and parent raters (= 10) estimated the content validity of each item (I‐CVI) as well as of the whole scale (S‐CVI/Ave), and participated in a cognitive interview about the GAIT scale. The mean scores for both I‐CVI and S‐CVI/Ave ranged between 0.97 and 0.99 compared with the lowest recommended I‐CVI value of 0.78 and the S‐CVI/Ave value of 0.90. There were no sex differences and no differences between expert groups regarding ratings in content validity. No differences in the overall evaluation of the scale emerged in the cognitive interviews. Our conclusions were that GAIT showed good content validity in capturing gaming addiction. The GAIT needs further investigation into its psychometric properties of construct validity (convergent and divergent validity) and criterion‐related validity, as well as its reliability in both clinical settings and in community settings with adolescents.  相似文献   

20.
The ad coefficient was developed to measure the within‐group agreement of ratings. The underlying theory as well as the construction of the coefficient are explained. The ad coefficient ranges from 0 to 1, regardless of the number of scale points, raters, or items. With some limitations the measure of the within‐group agreement of different groups and groups from different studies is directly comparable. For statistical significance testing, the binomial distribution is introduced as a model of the ratings' random distribution given the true score of a group construct. This method enables a decision about essential agreement and not only about a significant difference from 0 or a chosen critical value. The ad coefficient identifies a single true score within a group. It is not provided for multiple true score settings. The comparison of the ad coefficient with other agreement indices shows that the new coefficient is in line with their outcomes, but does not result in infinite or inappropriate values.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号