期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Variance Estimation of Nominal-Scale Inter-Rater Reliability with Random Selection of Raters

Kilem Li Gwet 《Psychometrika》2008,73(3):407-430

Most inter-rater reliability studies using nominal scales suggest the existence of two populations of inference: the population of subjects (collection of objects or persons to be rated) and that of raters. Consequently, the sampling variance of the inter-rater reliability coefficient can be seen as a result of the combined effect of the sampling of subjects and raters. However, all inter-rater reliability variance estimators proposed in the literature only account for the subject sampling variability, ignoring the extra sampling variance due to the sampling of raters, even though the latter may be the biggest of the variance components. Such variance estimators make statistical inference possible only to the subject universe. This paper proposes variance estimators that will make it possible to infer to both universes of subjects and raters. The consistency of these variance estimators is proved as well as their validity for confidence interval construction. These results are applicable only to fully crossed designs where each rater must rate each subject. A small Monte Carlo simulation study is presented to demonstrate the accuracy of large-sample approximations on reasonably small samples. 相似文献

2.

An implicit enumeration method for an exact test of weighted kappa

Michael J. Brusco Stephanie Stahl Douglas Steinley 《The British journal of mathematical and statistical psychology》2008,61(2):439-452

The kappa coefficient is one of the most widely used measures for evaluating the agreement between two raters asked to assign N objects to one of K nominal categories. Weighted versions of kappa enable partial credit to be awarded for near agreement, most notably in the case of ordinal categories. An exact significance test for weighted kappa can be conducted by enumerating all rater agreement tables with the same fixed marginal frequencies as the observed table, and accumulating the probabilities for all tables that produce a weighted kappa index that is greater than or equal to the observed measure. Unfortunately, complete enumeration of all tables is computationally unwieldy for modest values of N and K. We present an implicit enumeration algorithm for conducting an exact test of weighted kappa, which can be applied to tables of non‐trivial size. The algorithm is particularly efficient for ‘good’ to ‘excellent’ values of weighted kappa that typically have very small p‐values. Therefore, our method is beneficial for situations where resampling tests are of limited value because the number of trials needed to estimate the p‐value tends to be large. 相似文献

3.

Dispersion-weighted kappa: An integrative framework for metric and nominal scale agreement coefficients 总被引：1，自引：0，他引：1

Christof?Schuster Email author David?A.?Smith 《Psychometrika》2005,70(1):135-146

The rater agreement literature is complicated by the fact that it must accommodate at east two different properties of rating data: the number of raters (two versus more than two) and the rating scale level (nominal versus metric). While kappa statistics are most widely used for nominal scales, intraclass correlation coefficients have been preferred for metric scales. In this paper, we suggest a dispersion-weighted kappa framework for multiple raters that integrates some important agreement statistics by using familiar dispersion indices as weights for expressing disagreement. These weights are applied to ratings identifying cells in the traditional inter-judge contingency table. Novel agreement statistics can be obtained by applying less familiar indices of dispersion in the same wayThis revised article was published online in August 2005 with the PDF paginated correctly. 相似文献

4.

Some Paradoxical Results for the Quadratically Weighted Kappa 总被引：1，自引：0，他引：1

Matthijs?J.?Warrens Email author 《Psychometrika》2012,77(2):315-323

The quadratically weighted kappa is the most commonly used weighted kappa statistic for summarizing interrater agreement on an ordinal scale. The paper presents several properties of the quadratically weighted kappa that are paradoxical. For agreement tables with an odd number of categories n it is shown that if one of the raters uses the same base rates for categories 1 and n, categories 2 and n−1, and so on, then the value of quadratically weighted kappa does not depend on the value of the center cell of the agreement table. Since the center cell reflects the exact agreement of the two raters on the middle category, this result questions the applicability of the quadratically weighted kappa to agreement studies. If one wants to report a single index of agreement for an ordinal scale, it is recommended that the linearly weighted kappa instead of the quadratically weighted kappa is used. 相似文献

5.

Reconsidering the Equivalence of Multisource Performance Ratings: Evidence for the Importance and Meaning of Rater Factors

Bethany H. Bynum Brian J. Hoffman Adam W. Meade William A. Gentry 《Journal of business and psychology》2013,28(2):203-219

Purpose

The study specified an alternate model to examine the measurement invariance of multisource performance ratings (MSPRs) to systematically investigate the theoretical meaning of common method variance in the form of rater effects. As opposed to testing invariance based on a multigroup design with raters aggregated within sources, this study specified both performance dimension and idiosyncratic rater factors.

Design/Methodology/Approach

Data was obtained from 5,278 managers from a wide range of organizations and hierarchical levels, who were rated on the BENCHMARKS^® MSPR instrument.

Findings

Our results diverged from prior research such that MSPRs were found to lack invariance for raters from different levels. However, same level raters provided equivalent ratings in terms of both the performance dimension loadings and rater factor loadings.

Implications

The results illustrate the importance of modeling rater factors when investigating invariance and suggest that rater factors reflect substantively meaningful variance, not bias.

Originality/Value

The current study applies an alternative model to examine invariance of MSPRs that allowed us to answer three questions that would not be possible with more traditional multigroup designs. First, the model allowed us to examine the impact of paramaterizing idiosyncratic rater factors on inferences of cross-rater invariance. Next, including multiple raters from each organizational level in the MSPR model allowed us to tease apart the degree of invariance in raters from the same source, relative to raters from different sources. Finally, our study allowed for inferences with respect to the invariance of idiosyncratic rater factors. 相似文献

6.

Agreement between Two Independent Groups of Raters 总被引：1，自引：0，他引：1

Sophie Vanbelle Adelin Albert 《Psychometrika》2009,74(3):477-491

We propose a coefficient of agreement to assess the degree of concordance between two independent groups of raters classifying items on a nominal scale. This coefficient, defined on a population-based model, extends the classical Cohen’s kappa coefficient for quantifying agreement between two raters. Weighted and intraclass versions of the coefficient are also given and their sampling variance is determined by the Jackknife method. The method is illustrated on medical education data which motivated the research. 相似文献

7.

Hubert's multi-rater kappa revisited

Antonio Martín Andrés María Álvarez Hernández 《The British journal of mathematical and statistical psychology》2020,73(1):1-22

There is a frequent need to measure the degree of agreement among R observers who independently classify n subjects within K nominal or ordinal categories. The most popular methods are usually kappa-type measurements. When R = 2, Cohen's kappa coefficient (weighted or not) is well known. When defined in the ordinal case while assuming quadratic weights, Cohen's kappa has the advantage of coinciding with the intraclass and concordance correlation coefficients. When R > 2, there are more discrepancies because the definition of the kappa coefficient depends on how the phrase ‘an agreement has occurred’ is interpreted. In this paper, Hubert's interpretation, that ‘an agreement occurs if and only if all raters agree on the categorization of an object’, is used, which leads to Hubert's (nominal) and Schuster and Smith's (ordinal) kappa coefficients. Formulae for the large-sample variances for the estimators of all these coefficients are given, allowing the latter to illustrate the different ways of carrying out inference and, with the use of simulation, to select the optimal procedure. In addition, it is shown that Schuster and Smith's kappa coefficient coincides with the intraclass and concordance correlation coefficients if the first coefficient is also defined assuming quadratic weights. 相似文献

8.

Differences in Inter‐Rater Reliability and Accuracy for a Treatment Adherence Scale

《Cognitive behaviour therapy》2013,42(4):230-239

Inter‐rater reliability and accuracy are measures of rater performance. Inter‐rater reliability is frequently used as a substitute for accuracy despite conceptual differences and literature suggesting important differences between them. The aims of this study were to compare inter‐rater reliability and accuracy among a group of raters, using a treatment adherence scale, and to assess for factors affecting the reliability of these ratings. Paired undergraduate raters assessed therapist behavior by viewing videotapes of 4 therapists' cognitive behavioral therapy sessions. Ratings were compared with expert‐generated criterion ratings and between raters using intraclass correlation (2,1). Inter‐rater reliability was marginally higher than accuracy (p = 0.09). The specific therapist significantly affected inter‐rater reliability and accuracy. The frequency and intensity of the therapists' ratable behaviors of criterion ratings correlated only with rater accuracy. Consensus ratings were more accurate than individual ratings, but composite ratings were not more accurate than consensus ratings. In conclusion, accuracy cannot be assumed to exceed inter‐rater reliability or vice versa, and both are influenced by multiple factors. In this study, the subject of the ratings (i.e. the therapist and the intensity and frequency of rated behaviors) was shown to influence inter‐rater reliability and accuracy. The additional resources needed for a composite rating, a rating based on the average score of paired raters, may be justified by improved accuracy over individual ratings. The additional time required to arrive at a consensus rating, a rating generated following discussion between 2 raters, may not be warranted. Further research is needed to determine whether these findings hold true with other raters and treatment adherence scales. 相似文献

9.

Rater bias in the EASI temperament scales: a twin study 总被引：1，自引：0，他引：1

M C Neale J Stevenson 《Journal of personality and social psychology》1989,56(3):446-455

Under trait theory, ratings may be modeled as a function of the temperament of the child and the bias of the rater. Two linear structural equation models are described, one for mutual self- and partner ratings, and one for multiple ratings of related individuals. Application of the first model to EASI temperament data collected from spouses rating each other shows moderate agreement between raters and little rating bias. Spouse pairs agree moderately when rating their twin children, but there is significantly rater bias, with greater bias for monozygotic than for dizygotic twins. MLE's of heritability are approximately .5 for all temperament scales with no common environmental variance. Results are discussed with reference to trait validity, the person-situation debate, halo effects, and stereotyping. Questionnaire development using ratings on family members permits increased rater agreement and reduced rater bias. 相似文献

10.

An Iterative Parametric Bootstrap Approach to Evaluating Rater Fit

Wenjing Guo Stefanie A. Wind 《应用心理检测》2021,45(5):315

When analysts evaluate performance assessments, they often use modern measurement theory models to identify raters who frequently give ratings that are different from what would be expected, given the quality of the performance. To detect problematic scoring patterns, two rater fit statistics, the infit and outfit mean square error (MSE) statistics are routinely used. However, the interpretation of these statistics is not straightforward. A common practice is that researchers employ established rule-of-thumb critical values to interpret infit and outfit MSE statistics. Unfortunately, prior studies have shown that these rule-of-thumb values may not be appropriate in many empirical situations. Parametric bootstrapped critical values for infit and outfit MSE statistics provide a promising alternative approach to identifying item and person misfit in item response theory (IRT) analyses. However, researchers have not examined the performance of this approach for detecting rater misfit. In this study, we illustrate a bootstrap procedure that researchers can use to identify critical values for infit and outfit MSE statistics, and we used a simulation study to assess the false-positive and true-positive rates of these two statistics. We observed that the false-positive rates were highly inflated, and the true-positive rates were relatively low. Thus, we proposed an iterative parametric bootstrap procedure to overcome these limitations. The results indicated that using the iterative procedure to establish 95% critical values of infit and outfit MSE statistics had better-controlled false-positive rates and higher true-positive rates compared to using traditional parametric bootstrap procedure and rule-of-thumb critical values. 相似文献

11.

The ad coefficient as a descriptive measure of the within‐group agreement of ratings

Ludwig Kreuzpointner Patricia Simon Fabian J. Theis 《The British journal of mathematical and statistical psychology》2010,63(2):341-360

The a_d coefficient was developed to measure the within‐group agreement of ratings. The underlying theory as well as the construction of the coefficient are explained. The a_d coefficient ranges from 0 to 1, regardless of the number of scale points, raters, or items. With some limitations the measure of the within‐group agreement of different groups and groups from different studies is directly comparable. For statistical significance testing, the binomial distribution is introduced as a model of the ratings' random distribution given the true score of a group construct. This method enables a decision about essential agreement and not only about a significant difference from 0 or a chosen critical value. The a_d coefficient identifies a single true score within a group. It is not provided for multiple true score settings. The comparison of the a_d coefficient with other agreement indices shows that the new coefficient is in line with their outcomes, but does not result in infinite or inappropriate values. 相似文献

12.

Training raters to assess adult ADHD: reliability of ratings

Adler LA Spencer T Faraone SV Reimherr FW Kelsey D Michelson D Biederman J 《Journal of attention disorders》2005,8(3):121-126

The standardization of ADHD ratings in adults is important given their differing symptom presentation. The authors investigated the agreement and reliability of rater standardization in a large-scale trial of atomoxetine in adults with ADHD. Training of 91 raters for the investigator-administered ADHD Rating Scale (ADHDRS-IV-Inv) occurred prior to initiation of a large, 31-site atomoxetine trial. Agreement between raters on total scores was established in two ways: (a) by Kappa coefficient (rater agreement for each item with the percentage of raters that had identical item-by-item scores) and (b) intraclass correlation coefficients (reliability). For the ADHDRS-IV-Inv, rater agreement was moderate, and reliability, as measured by Cronbach's alpha, was substantial. The data indicate that clinicians can be trained to reliably evaluate ADHD in adults using the ADHDRS-IV-Inv. 相似文献

13.

Rater Personality,Rating Format,and Social Context: Implications for Performance Appraisal Ratings

Gunna J. Yun Lisa M. Donahue Nicole M. Dudley Lynn A. McFarland 《International Journal of Selection & Assessment》2005,13(2):97-107

This study investigates the effects of rater personality (Conscientiousness and Agreeableness), rating format (graphic rating scale vs. behavioral checklist), and the rating social context (face‐to‐face feedback vs. no face‐to‐face feedback) on rating elevation of performance ratings. As predicted, raters high on Agreeableness showed more elevated ratings than those low on Agreeableness when they expected to have the face‐to‐face feedback meeting. Furthermore, rating format moderated the relationship between Agreeableness and rating elevation, such that raters high on Agreeableness provided less elevated ratings when using the behavioral checklist than the graphic rating scale, whereas raters low on Agreeableness showed little difference in elevation across different rating formats. Results also suggest that the interactive effects of rater personality, rating format, and social context may depend on the performance level of the ratee. The implications of these findings will be discussed. 相似文献

14.

Peer Assessment of Aviation Performance: Inconsistent for Good Reasons

下载免费PDF全文

Wolff‐Michael Roth Timothy J. Mavin 《Cognitive Science》2015,39(2):405-433

Research into expertise is relatively common in cognitive science concerning expertise existing across many domains. However, much less research has examined how experts within the same domain assess the performance of their peer experts. We report the results of a modified think‐aloud study conducted with 18 pilots (6 first officers, 6 captains, and 6 flight examiners). Pairs of same‐ranked pilots were asked to rate the performance of a captain flying in a critical pre‐recorded simulator scenario. Findings reveal (a) considerable variance within performance categories, (b) differences in the process used as evidence in support of a performance rating, (c) different numbers and types of facts (cues) identified, and (d) differences in how specific performance events affect choice of performance category and gravity of performance assessment. Such variance is consistent with low inter‐rater reliability. Because raters exhibited good, albeit imprecise, reasons and facts, a fuzzy mathematical model of performance rating was developed. The model provides good agreement with observed variations. 相似文献

15.

Sources of Variance in Personality Facets: A Multiple‐Rater Twin Study of Self‐Peer,Peer‐Peer,and Self‐Self (Dis)Agreement

Christian Kandler Rainer Riemann Frank M. Spinath Alois Angleitner 《Journal of personality》2010,78(5):1565-1594

ABSTRACT This study considered the validity of the personality structure based on the Five‐Factor Model using both self‐ and peer reports on twins' NEO‐PI‐R facets. Separating common from specific genetic variance in self‐ and peer reports, this study examined genetic substance of different trait levels and rater‐specific perspectives relating to personality judgments. Data of 919 twin pairs were analyzed using a multiple‐rater twin model to disentangle genetic and environmental effects on domain‐level trait, facet‐specific trait, and rater‐specific variance. About two thirds of both the domain‐level trait variance and the facet‐specific trait variance was attributable to genetic factors. This suggests that the more personality is measured accurately, the better these measures reflect the genetic structure. Specific variance in self‐ and peer reports also showed modest to substantial genetic influence. This may indicate not only genetically influenced self‐rater biases but also substance components specific for self‐ and peer raters' perspectives on traits actually measured. 相似文献

16.

多源评价的特点与内在机制

张赟翁清雄《心理科学进展》2018,26(6):1131-1140

多源评价在国外企业中的运用已日益成熟, 但在我国还停留在探索与发展阶段。基于已有的研究发现, 围绕评价过程、评价源及被评价者三方面对多源评价的特点及内在机制进行了探讨与分析。从评价过程看, 其评价目的具有多重性, 评价形式注重匿名性, 且评价结果的合理应用非常重要; 从评价源看, 不同评价源间的评价一致性较低, 且易造成晕轮效应和宽大效应; 从被评价者来看, 个体对多源评价结果的反应, 受到个性特征、反馈信号及自我-他人评价间差距等因素影响。研究也发现, 多源评价所带来的绩效改进结果具有不稳定性。基于此, 如何提高多源评价过程的有效性与准确性, 改善评价者对评价结果的反应, 以及如何对多源评价结果进行有效汇总等是未来值得研究的重要内容。相似文献

17.

Ill-structured measurement designs in organizational research: implications for estimating interrater reliability

Putka DJ Le H McCloy RA Diaz T 《The Journal of applied psychology》2008,93(5):959-981

Organizational research and practice involving ratings are rife with what the authors term ill-structured measurement designs (ISMDs)--designs in which raters and ratees are neither fully crossed nor nested. This article explores the implications of ISMDs for estimating interrater reliability. The authors first provide a mock example that illustrates potential problems that ISMDs create for common reliability estimators (e.g., Pearson correlations, intraclass correlations). Next, the authors propose an alternative reliability estimator--G(q,k)--that resolves problems with traditional estimators and is equally appropriate for crossed, nested, and ill-structured designs. By using Monte Carlo simulation, the authors evaluate the accuracy of traditional reliability estimators compared with that of G(q,k) for ratings arising from ISMDs. Regardless of condition, G(q,k) yielded estimates as precise or more precise than those of traditional estimators. The advantage of G(q,k) over the traditional estimators became more pronounced with increases in the (a) overlap between the sets of raters that rated each ratee and (b) ratio of rater main effect variance to true score variance. Discussion focuses on implications of this work for organizational research and practice. 相似文献

18.

档案袋评价中评分者信度的实证研究

赵群曹亦薇《应用心理学》2006,12(3):258-263

档案袋评价因能充分发挥促进学生发展和教学改进的功能而受到青睐,但不佳的测评信度和效度限制了其在教学评价中的应用。本文对档案袋评分者信度的特点进行实证研究,4位评分者对152份档案袋进行了2次等级评分,运用多种统计方法计算评分者信度。结果表明,档案袋的评分有较高的关联性、中等偏弱的一致性和一定的稳定性,对档案袋整体水平的评分信度最高。本研究中,评分者个数为3时,对档案袋整体水平评分的概化系数和可靠性系数都在0.80以上。相似文献

19.

Development and Validation of Research Scales for the Leadership Multi‐rater Assessment of Personality (LMAP)

下载免费PDF全文

Brian S. Connelly Ronald A. Warren Hyunji Kim Stefano I. Di Domenico 《International Journal of Selection & Assessment》2016,24(4):362-367

This article presents large‐sample developmental and validation research for a set of research scales of an existing 360‐degree personality measure, the LMAP 360 (Leadership Multi‐rater Assessment of Personality). In Study 1 (N = 1,771), we identified 6 broad domains underlying LMAP item clusters: Neuroticism, Dominance, Enthusiasm, Openness, Agreeableness, and Conscientiousness. Scales measuring these broad domains and their constituent facets showed strong internal consistency, inter‐rater reliability, and self‐informant correlations. In Study 2 (N = 729 and N = 694), we examined LMAP research scales’ convergent and discriminant validity against three well‐validated personality inventories (Goldberg's adjectives, the Big Five Inventory, and the Big Five Aspects Scales) and one measure of cognitive ability (the International Cognitive Ability Resource). LMAP research scales correlated strongly with corresponding scales from other inventories and were distinct from cognitive ability. 相似文献

20.

INTERRATER CORRELATIONS DO NOT ESTIMATE THE RELIABILITY OF JOB PERFORMANCE RATINGS 总被引：5，自引：1，他引：4

KEVIN R. MURPHY RICHARD DESHON 《Personnel Psychology》2000,53(4):873-900

Interrater correlations are widely interpreted as estimates of the reliability of supervisory performance ratings, and are frequently used to correct the correlations between ratings and other measures (e.g., test scores) for attenuation. These interrater correlations do provide some useful information, but they are not reliability coefficients. There is clear evidence of systematic rater effects in performance appraisal, and variance associated with raters is not a source of random measurement error. We use generalizability theory to show why rater variance is not properly interpreted as measurement error, and show how such systematic rater effects can influence both reliability estimates and validity coefficients. We show conditions under which interrater correlations can either overestimate or underestimate reliability coefficients, and discuss reasons other than random measurement error for low interrater correlations. 相似文献