首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 41 毫秒
1.
2.
The rater agreement literature is complicated by the fact that it must accommodate at east two different properties of rating data: the number of raters (two versus more than two) and the rating scale level (nominal versus metric). While kappa statistics are most widely used for nominal scales, intraclass correlation coefficients have been preferred for metric scales. In this paper, we suggest a dispersion-weighted kappa framework for multiple raters that integrates some important agreement statistics by using familiar dispersion indices as weights for expressing disagreement. These weights are applied to ratings identifying cells in the traditional inter-judge contingency table. Novel agreement statistics can be obtained by applying less familiar indices of dispersion in the same wayThis revised article was published online in August 2005 with the PDF paginated correctly.  相似文献   

3.
黎光明  蒋欢 《心理科学》2019,(3):731-738
包含评分者侧面的测验通常不符合任意一种概化理论设计,因此从概化理论的角度来看这类测验下的数据应属于缺失数据,而决定缺失结构的就是测验的评分方案。用R软件模拟出三种评分方案下的数据,并比较传统法、评价法和拆分法在各评分方案下的估计效果,结果表明:(1)传统法估计准确性较差;(2)评分者一致性较高时,适宜用评价法进行估计;(3)拆分法的估计结果最准确,仅在固定评分者评分方案下需注意评分者与考生数量之比,该比值小于等于0.0047 时估计结果较为准确。  相似文献   

4.
When providing performance ratings, it is commonly assumed that raters agree more on rating items that are behaviorally based and observable than on items that are vague and less behaviorally based. This study empirically investigated the relationships between agreement among raters, raters' perceptions regarding their difficulty in providing ratings, and expert assessments of the behavioral observability of each item. The results, based on 611 raters in two studies conducted in different locations, suggest that contrary to common expectations, rater agreement can increase as raters' reported rating difficulty increases and as behavioral observability decreases. Explanations and implications are discussed.  相似文献   

5.
Inter-rater reliability and accuracy are measures of rater performance. Inter-rater reliability is frequently used as a substitute for accuracy despite conceptual differences and literature suggesting important differences between them. The aims of this study were to compare inter-rater reliability and accuracy among a group of raters, using a treatment adherence scale, and to assess for factors affecting the reliability of these ratings. Paired undergraduate raters assessed therapist behavior by viewing videotapes of 4 therapists' cognitive behavioral therapy sessions. Ratings were compared with expert-generated criterion ratings and between raters using intraclass correlation (2,1). Inter-rater reliability was marginally higher than accuracy (p = 0.09). The specific therapist significantly affected inter-rater reliability and accuracy. The frequency and intensity of the therapists' ratable behaviors of criterion ratings correlated only with rater accuracy. Consensus ratings were more accurate than individual ratings, but composite ratings were not more accurate than consensus ratings. In conclusion, accuracy cannot be assumed to exceed inter-rater reliability or vice versa, and both are influenced by multiple factors. In this study, the subject of the ratings (i.e. the therapist and the intensity and frequency of rated behaviors) was shown to influence inter-rater reliability and accuracy. The additional resources needed for a composite rating, a rating based on the average score of paired raters, may be justified by improved accuracy over individual ratings. The additional time required to arrive at a consensus rating, a rating generated following discussion between 2 raters, may not be warranted. Further research is needed to determine whether these findings hold true with other raters and treatment adherence scales.  相似文献   

6.
We investigated whether individualistic and collectivistic cultural values influence the extent to which raters consider task, citizenship (OCB), and counterproductive work behaviors (CWB) when evaluating overall employee performance. Participants completed a managerial role-play exercise in which they read employee performance vignettes and rated the overall performance of each employee. A relative weights approach was used to determine to what extent raters considered task, OCB, and CWB information when evaluating employee performance. Results indicated that as compared to individualistic raters, collectivistic raters placed higher weights on OCBs and less weight on task performance when assigning an overall performance rating. However, contrary to expectations, collectivistic raters did not place significantly higher weights on CWBs as compared to individualistic raters. Future research directions and practical implications are discussed.  相似文献   

7.
Inter‐rater reliability and accuracy are measures of rater performance. Inter‐rater reliability is frequently used as a substitute for accuracy despite conceptual differences and literature suggesting important differences between them. The aims of this study were to compare inter‐rater reliability and accuracy among a group of raters, using a treatment adherence scale, and to assess for factors affecting the reliability of these ratings. Paired undergraduate raters assessed therapist behavior by viewing videotapes of 4 therapists' cognitive behavioral therapy sessions. Ratings were compared with expert‐generated criterion ratings and between raters using intraclass correlation (2,1). Inter‐rater reliability was marginally higher than accuracy (p = 0.09). The specific therapist significantly affected inter‐rater reliability and accuracy. The frequency and intensity of the therapists' ratable behaviors of criterion ratings correlated only with rater accuracy. Consensus ratings were more accurate than individual ratings, but composite ratings were not more accurate than consensus ratings. In conclusion, accuracy cannot be assumed to exceed inter‐rater reliability or vice versa, and both are influenced by multiple factors. In this study, the subject of the ratings (i.e. the therapist and the intensity and frequency of rated behaviors) was shown to influence inter‐rater reliability and accuracy. The additional resources needed for a composite rating, a rating based on the average score of paired raters, may be justified by improved accuracy over individual ratings. The additional time required to arrive at a consensus rating, a rating generated following discussion between 2 raters, may not be warranted. Further research is needed to determine whether these findings hold true with other raters and treatment adherence scales.  相似文献   

8.
张赟  翁清雄 《心理科学进展》2018,26(6):1131-1140
多源评价在国外企业中的运用已日益成熟, 但在我国还停留在探索与发展阶段。基于已有的研究发现, 围绕评价过程、评价源及被评价者三方面对多源评价的特点及内在机制进行了探讨与分析。从评价过程看, 其评价目的具有多重性, 评价形式注重匿名性, 且评价结果的合理应用非常重要; 从评价源看, 不同评价源间的评价一致性较低, 且易造成晕轮效应和宽大效应; 从被评价者来看, 个体对多源评价结果的反应, 受到个性特征、反馈信号及自我-他人评价间差距等因素影响。研究也发现, 多源评价所带来的绩效改进结果具有不稳定性。基于此, 如何提高多源评价过程的有效性与准确性, 改善评价者对评价结果的反应, 以及如何对多源评价结果进行有效汇总等是未来值得研究的重要内容。  相似文献   

9.
In a sample of 240 college students intersibling agreement was examined for Goldberg's 100 unipolar Big Five adjective markers. Participants showed self-enhancement by rating themselves more favorably on three of the five traits (Agreeableness, Conscientiousness, and Culture/Intellect); however, self-ratings on Neuroticism were higher than siblings' ratings. Correlations among raters were moderate (mean r = .41) and comparable to values obtained in studies using peer ratings. The type of the sibling relationship, based on ratings of relationship quality, moderated the rank-order measures but not the mean agreement.  相似文献   

10.
探讨了康春花,孙小坚和曾平飞(2016)提出的等级反应多水平侧面模型(GR-MLFM)在包含被试及评分者层面预测变量(完整模型)下的返真性和适用性。结果表明:(1)GR-MLFM完整模型具有逻辑上和数理上的合理性,可用于主观题的评分情境,能较好地检测出评分者效应、影响因素及其影响程度;(2)在数学问题解决的评分实践中,评分员存在两种类型的评分倾向(宽松和严格效应),但绝大多数评分员的宽严度不明显;评分者的责任心可正向预测其严格程度,自信心可正向预测其宽松程度,而情绪稳定性和评分经验的预测作用不显著。  相似文献   

11.
Using samples from Iowa OASI sub-populations, the relationships between NMZ scores and the criteria of acceptance for services by the Iowa DVR as well as future rehabilitation of clients were examined. The results suggest that (a) the relationship of NMZ scores to the acceptance criterion is low; (b) the relationship of NMZ scores to the more ultimate criterion of client rehabilitation is moderate, and, here, empirical weights appear to work better than the intuitive weights assigned in the original development of NMZ scores; (c) through empirical weighting, the variable of age was revealed as most important in predicting such criteria.  相似文献   

12.
Little research has been done on the effects of peer raters’ quality characteristics on peer rating qualities. This study aims to address this gap and investigate the effects of key variables related to peer raters’ qualities, including content knowledge, previous rating experience, training on rating tasks, and rating motivation. In an experiment where training and motivation interventions were manipulated, 24 classes with 838 high school students were randomly assigned to study conditions. Inter-rater error, intra-rater error and criterion error indices for peer ratings on four selected essays were analyzed using hierarchical linear models. Results indicated that peer raters’ content knowledge, previous rating experience, and rating motivation were associated with rating errors. This study also found some significant interactions between peer raters’ quality characteristics. Implications for in-person and online peer assessments as well as future directions are discussed.  相似文献   

13.
This study investigated the following hypothesis: physiological, psychological, and verbal behavior indices of communication apprehension can predict comprehension, perception of speaker credibility, and ratings of speech effectiveness. The stimulus materials were videotapes of the first minute of 85 different students expressing their views on women's liberation. Measurement on all the indices of communication apprehension had been taken on these students as the videotapes were being prepared. Each of these one-minute videotapes was shown to a single receiver who then filled out forms measuring comprehension, perception of source credibility, and rating of speech effectiveness. Results supported the hypothesis that the indices of communication apprehension could predict all the communication effects save one, perception of character. The strongest relationship between the set of communication apprehension variables and the set of communication effectiveness variables indicated that individuals who reported high apprehension experienced much silence in their speech and received low ratings on language facility, vocal characteristics, and general effectiveness.  相似文献   

14.
This study investigates the effects of rater personality (Conscientiousness and Agreeableness), rating format (graphic rating scale vs. behavioral checklist), and the rating social context (face‐to‐face feedback vs. no face‐to‐face feedback) on rating elevation of performance ratings. As predicted, raters high on Agreeableness showed more elevated ratings than those low on Agreeableness when they expected to have the face‐to‐face feedback meeting. Furthermore, rating format moderated the relationship between Agreeableness and rating elevation, such that raters high on Agreeableness provided less elevated ratings when using the behavioral checklist than the graphic rating scale, whereas raters low on Agreeableness showed little difference in elevation across different rating formats. Results also suggest that the interactive effects of rater personality, rating format, and social context may depend on the performance level of the ratee. The implications of these findings will be discussed.  相似文献   

15.
A number of rating systems are available to evaluate emotional communication in a single modality. The main purpose of this study was to develop procedures to train human raters to evaluate posed expressions of emotion across three different channels of communication, i.e., facial, prosodic/intonational, and lexical/verbal. These procedures were used to evaluate posed emotional expressions produced by individuals with unilateral brain lesions from stroke. Posers in this preliminary report were two right brain-damaged, two left brain-damaged, and two normal control right-handed adults who were matched on demographic and neurological factors. Eight emotional expressions, both positive and negative, were produced in three channels and rated for intensity, pleasantness, and category accuracy. 15 normal adults served as raters, five per channel. The rating procedures were comparable across channels, with analogous properties, and yielded substantial interrater agreement. In this small sample of posers, it was observed that the expressions of the right brain-damaged group were rated as the least accurate and those of the left brain-damaged group as the most intense. When patterns of individual performance across the channels were examined, performance was quiet consistent for the normal controls yet variable for the right brain-damaged persons. These observations are in keeping with the notion that patients with right hemisphere pathology have difficulty in emotional communication. In summary, these findings suggest that comparison of emotional expressions across multiple channels is feasible.  相似文献   

16.
The consistency and loci of leniency, halo, and range restriction effects in performance ratings were investigated in a longitudinal study. Ratings were provided by approximately 90 supervisors in a metropolitan police department, who rated approximately 350 police-rank subordinates on five occasions over a three and one-half year period. Rating effects were computed separately as rater-and ratee-based statistics, and intercorrelated among the five rating periods. The nature of the data set made it possible to hold either raters or ratees constant for each analysis, thus permitting inferences regarding the sources of reliable variance in effects as due to raters or ratees. It was concluded that reliable variance in mean ratings is partly attributable to ratees, but mainly introduced by raters. Reliable halo variance is attributable to raters, and range restriction is a product of stable group performance variability within intact ratee groups. Implications of these results for future rating process research are discussed.  相似文献   

17.
This paper demonstrates and compares methods for estimating the interrater reliability and interrater agreement of performance ratings. These methods can be used by applied researchers to investigate the quality of ratings gathered, for example, as criteria for a validity study, or as performance measures for selection or promotional purposes. While estimates of interrater reliability are frequently used for these purposes, indices of interrater agreement appear to be rarely reported for performance ratings. A recommended index of interrater agreement, theT index (Tinsley & Weiss, 1975), is compared to four methods of estimating interrater reliability (Pearsonr, coefficient alpha, mean correlation between raters, and intraclass correlation). Subordinate and superior ratings of the performance of 100 managers were used in these analyses. The results indicated that, in general, interrater agreement and reliability among subordinates were fairly high. Interrater agreement between subordinates and superiors was moderately high; however, interrater reliability between these two rating sources was very low. The results demonstrate that interrater agreement and reliability are distinct indices and that both should be reported. Reasons are discussed as to why interrater reliability should not be reported alone.This paper is based, in part, on a thesis submitted to East Carolina University by the second author. Portions of this study were presented at the American Psychological Association meeting in New Orleans, LA, August, 1989. The authors would like to thank Michael Campion and two anonymous reviewers for their comments on earlier drafts of this paper.  相似文献   

18.
相对于其它评价中心技术而言,在无领导小组讨论中考官因素对评分结果的影响尤为重要.本研究主要探讨无领导小组讨论中新手考官的工作记忆与人格对其评分有效性的影响.结果发现,首先,新手考官的评分者一致性较低,评分准确度较差.其次,工作记忆和人格的部分因素分别从不同方面影响新手考官的评分有效性,具体表现在:(1)利他性越强,新手考官评分总均值的准确性越高,且评分结果越宽松;(2)新手考官的决断性越强,对所有应聘者做出有效区分的准确性越高;(3)新手考官的沉稳性越高,对各维度的区分越有效;(4)注意转换和抑制能力对新手考官的晕轮效应及其在各个维度上进行区分的准确度有抑制作用.  相似文献   

19.
This article describes the 1997 revision of the Dutch Rating System for Test Quality used by the Committee of Test Affairs of the Dutch Association of Psychologists (COTAN). The revised rating system evaluates the quality of a test on 7 criteria: Theoretical basis and the soundness of the test development procedure, Quality of the testing materials, Comprehensiveness of the manual, Norms, Reliability, Construct validity, and Criterion validity. For each criterion, a checklist with a number of items is provided. Some items (for each criterion at least 1) are so-called key questions, which check whether certain minimum conditions are met. If a key question is rated negative, the rating for that criterion will automatically be "insufficient." To enhance a uniform interpretation of the items by the raters and to explain the system to test users and test developers, comment sections provide detailed information on rating and weighting the items. Once the items have been rated, the final grades (insufficient, sufficient, or good) for the 7 criteria are established by means of weighting rules.  相似文献   

20.
During three sessions, each of 24 Ss responded to noxious thermal stimuli, using the following judgments: binary decision, S responded “high” or “low”; sensory intensity rating, S rated his sensory experience along a thermal intensity continuum; and concurrent report, S’s binary decision was followed by an intensity rating. The binary-decision d’ was significantly higher than the rating d′, suggesting that Ss could not maintain multiple thermal criteria in a consistent fashion. The criteria for pain obtained with single and concurrent intensity rating judgments did not differ. These results suggest that the most efficacious and valid method for the study of experimental pain is to obtain concurrent responses, and to use binary decisions to compute d’ and sensory intensity ratings to locate S’s criterion for reporting pain.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号