首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 375 毫秒
1.
Agreement between Two Independent Groups of Raters   总被引:1,自引:0,他引:1  
We propose a coefficient of agreement to assess the degree of concordance between two independent groups of raters classifying items on a nominal scale. This coefficient, defined on a population-based model, extends the classical Cohen’s kappa coefficient for quantifying agreement between two raters. Weighted and intraclass versions of the coefficient are also given and their sampling variance is determined by the Jackknife method. The method is illustrated on medical education data which motivated the research.  相似文献   

2.
Some Paradoxical Results for the Quadratically Weighted Kappa   总被引:1,自引:0,他引:1  
The quadratically weighted kappa is the most commonly used weighted kappa statistic for summarizing interrater agreement on an ordinal scale. The paper presents several properties of the quadratically weighted kappa that are paradoxical. For agreement tables with an odd number of categories n it is shown that if one of the raters uses the same base rates for categories 1 and n, categories 2 and n−1, and so on, then the value of quadratically weighted kappa does not depend on the value of the center cell of the agreement table. Since the center cell reflects the exact agreement of the two raters on the middle category, this result questions the applicability of the quadratically weighted kappa to agreement studies. If one wants to report a single index of agreement for an ordinal scale, it is recommended that the linearly weighted kappa instead of the quadratically weighted kappa is used.  相似文献   

3.
An algorithm and associated FORTRAN program are provided for the exact variance of weighted kappa. Program VARKAP provides the weighted kappa test statistic, the exact variance of weighted kappa, a Z score, one-sided lower- and upper-tail N(0,1) probability values, and the two-tail N(0,1) probability value.  相似文献   

4.
Resampling probability values for weighted kappa with multiple raters   总被引:1,自引:0,他引:1  
A new procedure to compute weighted kappa with multiple raters is described. A resampling procedure to compute approximate probability values for weighted kappa with multiple raters is presented. Applications of weighted kappa are illustrated with an example analysis of classifications by three independent raters.  相似文献   

5.
The kappa coefficient is one of the most widely used measures for evaluating the agreement between two raters asked to assign N objects to one of K nominal categories. Weighted versions of kappa enable partial credit to be awarded for near agreement, most notably in the case of ordinal categories. An exact significance test for weighted kappa can be conducted by enumerating all rater agreement tables with the same fixed marginal frequencies as the observed table, and accumulating the probabilities for all tables that produce a weighted kappa index that is greater than or equal to the observed measure. Unfortunately, complete enumeration of all tables is computationally unwieldy for modest values of N and K. We present an implicit enumeration algorithm for conducting an exact test of weighted kappa, which can be applied to tables of non‐trivial size. The algorithm is particularly efficient for ‘good’ to ‘excellent’ values of weighted kappa that typically have very small p‐values. Therefore, our method is beneficial for situations where resampling tests are of limited value because the number of trials needed to estimate the p‐value tends to be large.  相似文献   

6.
Pi (π) and kappa (κ) statistics are widely used in the areas of psychiatry and psychological testing to compute the extent of agreement between raters on nominally scaled data. It is a fact that these coefficients occasionally yield unexpected results in situations known as the paradoxes of kappa. This paper explores the origin of these limitations, and introduces an alternative and more stable agreement coefficient referred to as the AC1 coefficient. Also proposed are new variance estimators for the multiple‐rater generalized π and AC1 statistics, whose validity does not depend upon the hypothesis of independence between raters. This is an improvement over existing alternative variances, which depend on the independence assumption. A Monte‐Carlo simulation study demonstrates the validity of these variance estimators for confidence interval construction, and confirms the value of AC1 as an improved alternative to existing inter‐rater reliability statistics.  相似文献   

7.
The rater agreement literature is complicated by the fact that it must accommodate at east two different properties of rating data: the number of raters (two versus more than two) and the rating scale level (nominal versus metric). While kappa statistics are most widely used for nominal scales, intraclass correlation coefficients have been preferred for metric scales. In this paper, we suggest a dispersion-weighted kappa framework for multiple raters that integrates some important agreement statistics by using familiar dispersion indices as weights for expressing disagreement. These weights are applied to ratings identifying cells in the traditional inter-judge contingency table. Novel agreement statistics can be obtained by applying less familiar indices of dispersion in the same wayThis revised article was published online in August 2005 with the PDF paginated correctly.  相似文献   

8.
Most inter-rater reliability studies using nominal scales suggest the existence of two populations of inference: the population of subjects (collection of objects or persons to be rated) and that of raters. Consequently, the sampling variance of the inter-rater reliability coefficient can be seen as a result of the combined effect of the sampling of subjects and raters. However, all inter-rater reliability variance estimators proposed in the literature only account for the subject sampling variability, ignoring the extra sampling variance due to the sampling of raters, even though the latter may be the biggest of the variance components. Such variance estimators make statistical inference possible only to the subject universe. This paper proposes variance estimators that will make it possible to infer to both universes of subjects and raters. The consistency of these variance estimators is proved as well as their validity for confidence interval construction. These results are applicable only to fully crossed designs where each rater must rate each subject. A small Monte Carlo simulation study is presented to demonstrate the accuracy of large-sample approximations on reasonably small samples.  相似文献   

9.
The authors modeled sources of error variance in job specification ratings collected from 3 levels of raters across 5 organizations (N=381). Variance components models were used to estimate the variance in ratings attributable to true score (variance between knowledge, skills, abilities, and other characteristics [KSAOs]) and error (KSAO-by-rater and residual variance). Subsequent models partitioned error variance into components related to the organization, position level, and demographic characteristics of the raters. Analyses revealed that the differential ordering of KSAOs by raters was not a function of these characteristics but rather was due to unexplained rating differences among the raters. The implications of these results for job specification and validity transportability are discussed.  相似文献   

10.
The consistency and loci of leniency, halo, and range restriction effects in performance ratings were investigated in a longitudinal study. Ratings were provided by approximately 90 supervisors in a metropolitan police department, who rated approximately 350 police-rank subordinates on five occasions over a three and one-half year period. Rating effects were computed separately as rater-and ratee-based statistics, and intercorrelated among the five rating periods. The nature of the data set made it possible to hold either raters or ratees constant for each analysis, thus permitting inferences regarding the sources of reliable variance in effects as due to raters or ratees. It was concluded that reliable variance in mean ratings is partly attributable to ratees, but mainly introduced by raters. Reliable halo variance is attributable to raters, and range restriction is a product of stable group performance variability within intact ratee groups. Implications of these results for future rating process research are discussed.  相似文献   

11.
In an attempt to discover the facial action units for affective states that occur during complex learning, this study adopted an emote-aloud procedure in which participants were recorded as they verbalised their affective states while interacting with an intelligent tutoring system (AutoTutor). Participants’ facial expressions were coded by two expert raters using Ekman's Facial Action Coding System and analysed using association rule mining techniques. The two expert raters received an overall kappa that ranged between .76 and .84. The association rule mining analysis uncovered facial actions associated with confusion, frustration, and boredom. We discuss these rules and the prospects of enhancing AutoTutor with non-intrusive affect-sensitive capabilities.  相似文献   

12.
各种心理调查、心理实验中, 数据的缺失随处可见。由于数据缺失, 给概化理论分析非平衡数据的方差分量带来一系列问题。基于概化理论框架下, 运用Matlab 7.0软件, 自编程序模拟产生随机双面交叉设计p×i×r缺失数据, 比较和探讨公式法、REML法、拆分法和MCMC法在估计各个方差分量上的性能优劣。结果表明:(1) MCMC方法估计随机双面交叉设计p×i×r缺失数据方差分量, 较其它3种方法表现出更强的优势; (2) 题目和评分者是缺失数据方差分量估计重要的影响因素。  相似文献   

13.
This report examines two methodologic concerns pertaining to use of the cloze procedure in studying the predictability of schizophrenic speech, scoring criteria and raters' education (at or below college level). We find that two strategies for scoring the predictions of raters, one requiring the exact word, the other permitting a reasonable synonym, do not appear to differ in distinguishing groups of patients. The accuracy of raters' guessing is, however, correlated with raters' education: the more educated the rater, the more accurate the guessing. Thought-disordered schizophrenic speech is significantly less predictable than that of nonthought-disordered schizophrenics and normal controls when scored by less educated raters. These differences diminish when more highly educated raters are used. We conclude that raters' education can influence the sensitivity of cloze analysis.  相似文献   

14.
该研究应用GT和多面Rasch模型对结构化面试数据进行分析,并提出一些建议针对某辅导员招聘面试数据,运用GT从宏观上分析应聘者、考官和项目所带来的总体误差大小,在此基础上,运用多面Rasch模型从微观上进一步探查考官严厉度、应聘者能力差异、项目难易度及侧面偏差.结果表明:1)GT分析表明应聘者产生的变异较大(90.65%),说明面试可靠性较高,且当考官数为2时可靠性已较好.2)多面Rasch模型分析出了各侧面效应中的非拟合因素及交互效应中的偏差因素,表明面试误差主要来自考官间严厉度的差异及其自身一致性的不稳定。将GT与多面Rasch模型相结合分析面试数据不仅能测查出评价过程各方面的问题因素,并能更好地作整体把握。  相似文献   

15.
普通话测试的录音评分可行性、信度及经济效率   总被引:9,自引:0,他引:9  
该研究采用心理测量中的概化(generalizability theory)理论,通过两个研究,分析国家语言文字工作委员会的普通话测试中采用录音评分的可行性,并探讨了其信度、经济效率及心理测量等特性。研究共有25名被试及8名评分员。结果表明录音评分和现场评分测试的结果是一致的,最少能区分90%的能力差异。此外,研究亦指出现行测试的评分者人数及题数已经算足够,但仍可依考生能力特性等,作一些调整以提高测试效率。  相似文献   

16.
When the raters participating in a reliability study are a random sample from a larger population of raters, inferences about the intraclass correlation coefficient must be based on the three mean squares from the analysis of variance table summarizing the results: between subjects, between raters, and error. An approximate confidence interval for the parameter is presented as a function of these three mean squares.Dr. Fleiss is also with the Biometrics Research Unit of the New York State Psychiatric Institute. This work was supported in part by grant DE 04068 from the National Institute of Dental Research.  相似文献   

17.
黎光明  蒋欢 《心理科学》2019,(3):731-738
包含评分者侧面的测验通常不符合任意一种概化理论设计,因此从概化理论的角度来看这类测验下的数据应属于缺失数据,而决定缺失结构的就是测验的评分方案。用R软件模拟出三种评分方案下的数据,并比较传统法、评价法和拆分法在各评分方案下的估计效果,结果表明:(1)传统法估计准确性较差;(2)评分者一致性较高时,适宜用评价法进行估计;(3)拆分法的估计结果最准确,仅在固定评分者评分方案下需注意评分者与考生数量之比,该比值小于等于0.0047 时估计结果较为准确。  相似文献   

18.
This study developed and evaluated the inter-rater reliability of a semistructured interview for psychological autopsy (SIPA). The SIPA is composed of 69 items that are distributed into four modules (precipitators and stressors, motivation, lethality, intentionality). The interviews of 42 subjects, related to 21 cases of suicide, were audiotaped and then transcribed and evaluated by the interviewer, and also evaluated by a research assistant and two referees who all acted independently. The SIPA was able to provide information that demonstrated a high degree of concordance (kappa) among the raters. The results of this study demonstrate that the SIPA is a very reliable instrument for psychological autopsy in cases of suicide.  相似文献   

19.
Becoming a fluent reader has been established as important to reading comprehension. Prosody (expression) is an indicator of fluent reading that is linked to improved comprehension in students across elementary, middle, and secondary grades. Fluent reading is most often evaluated by classroom teachers through the use of a rubric, with the most common being the Multi-Dimensional Fluency Scale (MDFS) and the National Assessment of Educational Progress (NAEP) scale. This investigation uses a generalizability study (G-study) and a decision study (G-study) to determine reliability and efficiency of the two rubrics across five raters and two rating occasions in 177 first- through third-grade students. The results revealed the MDFS and NAEP to be parallel instruments with variance attributable to raters ranging from nearly 0 to 2.2%. Generalizability coefficients ranging from 0.91 to 0.94, indicating high reliability were found for both instruments. Recommendations for administration efficiency of each rubric are provided and instructional implications are discussed.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号