首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Inter‐rater reliability and accuracy are measures of rater performance. Inter‐rater reliability is frequently used as a substitute for accuracy despite conceptual differences and literature suggesting important differences between them. The aims of this study were to compare inter‐rater reliability and accuracy among a group of raters, using a treatment adherence scale, and to assess for factors affecting the reliability of these ratings. Paired undergraduate raters assessed therapist behavior by viewing videotapes of 4 therapists' cognitive behavioral therapy sessions. Ratings were compared with expert‐generated criterion ratings and between raters using intraclass correlation (2,1). Inter‐rater reliability was marginally higher than accuracy (p = 0.09). The specific therapist significantly affected inter‐rater reliability and accuracy. The frequency and intensity of the therapists' ratable behaviors of criterion ratings correlated only with rater accuracy. Consensus ratings were more accurate than individual ratings, but composite ratings were not more accurate than consensus ratings. In conclusion, accuracy cannot be assumed to exceed inter‐rater reliability or vice versa, and both are influenced by multiple factors. In this study, the subject of the ratings (i.e. the therapist and the intensity and frequency of rated behaviors) was shown to influence inter‐rater reliability and accuracy. The additional resources needed for a composite rating, a rating based on the average score of paired raters, may be justified by improved accuracy over individual ratings. The additional time required to arrive at a consensus rating, a rating generated following discussion between 2 raters, may not be warranted. Further research is needed to determine whether these findings hold true with other raters and treatment adherence scales.  相似文献   

2.
Multiracial Americans represent a rapidly growing population (Shih & Sanchez, 2009); however, very little is known about the types of challenges and resilience experienced by these individuals. To date, few psychological measures have been created specifically to investigate the experiences of multiracial people. This article describes 2 studies focused on the development and psychometric properties of the Multiracial Challenges and Resilience Scale (MCRS). The MCRS was developed using a nationwide Internet sample of urban, multiracial adults. Exploratory factor analyses revealed 4 Challenge factors (Others' Surprise and Disbelief Regarding Racial Heritage, Lack of Family Acceptance, Multiracial Discrimination, and Challenges With Racial Identity) and 2 Resilience factors (Appreciation of Human Differences and Multiracial Pride). A confirmatory factor analysis with data from a second sample provided support for the stability of this factor structure. The reliability and validity of the measure, implications of these findings, and suggestions for future research are discussed.  相似文献   

3.
The standardization of ADHD ratings in adults is important given their differing symptom presentation. The authors investigated the agreement and reliability of rater standardization in a large-scale trial of atomoxetine in adults with ADHD. Training of 91 raters for the investigator-administered ADHD Rating Scale (ADHDRS-IV-Inv) occurred prior to initiation of a large, 31-site atomoxetine trial. Agreement between raters on total scores was established in two ways: (a) by Kappa coefficient (rater agreement for each item with the percentage of raters that had identical item-by-item scores) and (b) intraclass correlation coefficients (reliability). For the ADHDRS-IV-Inv, rater agreement was moderate, and reliability, as measured by Cronbach's alpha, was substantial. The data indicate that clinicians can be trained to reliably evaluate ADHD in adults using the ADHDRS-IV-Inv.  相似文献   

4.
A program is described for computing interrater reliability by averaging, for each rater, the correlations between one rater’s ratings and every other rater’s ratings. For situations in which raters rate more than one ratee, raters’ reliabilities can be computed for either each item or each ratee. The program reads data from a text file and puts the reliability coefficients in a text file. The standard Macintosh interface is implemented. The Quick-BASIC program is distributed both as a listing and in compiled form; it can be run with advantage with math coprocessors.  相似文献   

5.
Interrater correlations are widely interpreted as estimates of the reliability of supervisory performance ratings, and are frequently used to correct the correlations between ratings and other measures (e.g., test scores) for attenuation. These interrater correlations do provide some useful information, but they are not reliability coefficients. There is clear evidence of systematic rater effects in performance appraisal, and variance associated with raters is not a source of random measurement error. We use generalizability theory to show why rater variance is not properly interpreted as measurement error, and show how such systematic rater effects can influence both reliability estimates and validity coefficients. We show conditions under which interrater correlations can either overestimate or underestimate reliability coefficients, and discuss reasons other than random measurement error for low interrater correlations.  相似文献   

6.
赵群  曹亦薇 《应用心理学》2006,12(3):258-263
档案袋评价因能充分发挥促进学生发展和教学改进的功能而受到青睐,但不佳的测评信度和效度限制了其在教学评价中的应用。本文对档案袋评分者信度的特点进行实证研究,4位评分者对152份档案袋进行了2次等级评分,运用多种统计方法计算评分者信度。结果表明,档案袋的评分有较高的关联性、中等偏弱的一致性和一定的稳定性,对档案袋整体水平的评分信度最高。本研究中,评分者个数为3时,对档案袋整体水平评分的概化系数和可靠性系数都在0.80以上。  相似文献   

7.
A total of 4 raters, including 2 teachers and 2 research assistants, used Direct Behavior Rating Single Item Scales (DBR-SIS) to measure the academic engagement and disruptive behavior of 7 middle school students across multiple occasions. Generalizability study results for the full model revealed modest to large magnitudes of variance associated with persons (students), occasions of measurement (day), and associated interactions. However, an unexpectedly low proportion of the variance in DBR data was attributable to the facet of rater, as well as a negligible variance component for the facet of rating occasion nested within day (10-min interval within a class period). Results of a reduced model and subsequent decision studies specific to individual rater and rater type (research assistant and teacher) suggested degree of reliability-like estimates differed substantially depending on rater. Overall, findings supported previous recommendations that in the absence of estimates of rater reliability and firm recommendations regarding rater training, ratings obtained from DBR-SIS, and subsequent analyses, be conducted within rater. Additionally, results suggested that when selecting a teacher rater, the person most likely to substantially interact with target students during the specified observation period may be the best choice.  相似文献   

8.
This study examined the short-interval test-retest reliability of the Structured Clinical Interview (SCID-II: First, Spitzer, Gibbon, & Williams, 1995) for DSM-IV personality disorders (PDs). The SCID-II was administered to 69 in- and outpatients on two occasions separated by 1 to 6 weeks. The interviews were conducted at three sites by ten raters. Each rater acted as first and as second rater and equal number of times. The test-retest interrater reliability for the presence or absence of any PD was fair to good (kappa = .63) and was higher than values found in previous short-interval test-retest studies with the SCID-II for DSM-III-R. Test-retest reliability coefficients for trait and sumscores were sufficient, except for dependent PD. Values for single criteria were variable, ranging from poor to good agreement. Further large-scale test-retest research is needed to test the interrater reliability of more categorical diagnoses and single traits.  相似文献   

9.
国家公务员结构化面试中评委偏差的IRT分析   总被引:7,自引:1,他引:6  
孙晓敏  张厚粲 《心理学报》2006,38(4):614-625
使用项目反应理论(IRT)中的多面Rasch模型,对两组共12名评委在国家公务员结构化面试中的评委偏差进行了分析。提出并验证了两种评委偏差:评委之间在宽严程度上的差异和评委自身的一致性问题。结果发现:不同评委之间在宽严程度上差异显著,且不同评委评定行为的跨考生、跨维度、跨性别、跨时间的自身一致性也存在差异。研究表明,这种进入到评委个体层次的分析突破了经典测量理论(CTT)定位于评委群体进行分析的局限,针对每位评委的偏差行为提供了详细具体的诊断信息,从而为评委的针对性培训和评委库的建立提供了现代测量学的新方法  相似文献   

10.
创造力测评中的评分者效应(rater effects)是指在创造性测评过程中, 由于评分者参与而对测评结果造成的影响.评分者效应本质上源于评分者内在认知加工的不同, 具体体现在其评分结果的差异.本文首先概述了评分者认知的相关研究, 以及评分者,创作者,社会文化因素对测评的影响.其次在评分结果层面梳理了评分者一致性信度的指标及其局限, 以及测验概化理论和多面Rasch模型在量化,控制该效应中的应用.最后基于当前研究仍存在的问题, 指出了未来可能的研究方向, 包括深化评分者认知研究,整合不同层面评分者效应的研究, 以及拓展创造力测评方法和技术等.  相似文献   

11.
The factor congruence and an analysis of potential bias of the Teacher Ratings of Social Skills (TROSS) were the focus of this study. Preliminary research on the TROSS has shown it to possess adequate reliability and validity. A sample of 250 mainstreamed school-age children from four different groups that were behavior-disordered, learning-disabled mildly mentally retarded/educationally handicapped, and nonhandicapped was used to examine (a) rater, ratee, and sex biases in TROSS ratings by teachers, (b) concurrent validity and reliability, and (c) factor congruence with a previous investigation of nonhandicapped childrenThe results indicate that the TROSS discriminated between mainstreamed handicapped and nonhandicapped students at a reasonably high level. No rater, ratee, or sex biases were found. Coefficient alphas indicate that the TROSS is a highly reliable instrument. The factor structures of the present and previous research were essentially equivalent. In view of these results, the TROSS appears to be an instrument that can confidently be used as a screening instrument in a social skills assessment package.  相似文献   

12.
Inter-rater reliability and accuracy are measures of rater performance. Inter-rater reliability is frequently used as a substitute for accuracy despite conceptual differences and literature suggesting important differences between them. The aims of this study were to compare inter-rater reliability and accuracy among a group of raters, using a treatment adherence scale, and to assess for factors affecting the reliability of these ratings. Paired undergraduate raters assessed therapist behavior by viewing videotapes of 4 therapists' cognitive behavioral therapy sessions. Ratings were compared with expert-generated criterion ratings and between raters using intraclass correlation (2,1). Inter-rater reliability was marginally higher than accuracy (p = 0.09). The specific therapist significantly affected inter-rater reliability and accuracy. The frequency and intensity of the therapists' ratable behaviors of criterion ratings correlated only with rater accuracy. Consensus ratings were more accurate than individual ratings, but composite ratings were not more accurate than consensus ratings. In conclusion, accuracy cannot be assumed to exceed inter-rater reliability or vice versa, and both are influenced by multiple factors. In this study, the subject of the ratings (i.e. the therapist and the intensity and frequency of rated behaviors) was shown to influence inter-rater reliability and accuracy. The additional resources needed for a composite rating, a rating based on the average score of paired raters, may be justified by improved accuracy over individual ratings. The additional time required to arrive at a consensus rating, a rating generated following discussion between 2 raters, may not be warranted. Further research is needed to determine whether these findings hold true with other raters and treatment adherence scales.  相似文献   

13.
基于科学创造力的结构模型、青少年科学创造力的表现及托兰斯的创造性测验,编制了青少年科学创造力测验,并用该测验施测于英国的1087名中学生和中国的1087名中学生,结果表明:(1)《青少年科学创造力测验》具有较高的信度,Cronbachα系数、评分者信度、重测信度均达到心理测验学要求的水平;(2)《青少年科学创造力测验》具有较高的结构效度。  相似文献   

14.
We examined Work Behavior to knowledge, skill, or ability linkage ratings for 9 jobs to determine the degree to which differences in the ratings were due to rater type. We collected ratings from incumbents and 2 types of job analysts: project job analysts (analysts knowledgeable of the job) and nonproject job analysts (analysts with very little or no knowledge of the job). In our analyses of the data, we calculated means, standard deviations, effect sizes, and correlations for each rater type, as well as compared the reliability of the ratings. We also estimated variance components for each job by conducting generalizability analyses ( Brennan, 1983 ; Shavelson, Webb, & Rowley, 1989 ). Our findings indicate that the level of linkage ratings is similar across rater types, that it is important to obtain ratings from multiple raters regardless of rater type, and that ratings from job analysts may be more reliable than those of incumbents.  相似文献   

15.
采用多侧面Rasch模型对28位评委在托幼机构教育质量评价中的评委偏差进行了分析。分析结果显示:28名评委评分宽严度差异显著;3名评委内部一致性较差,其余25名评委内部一致性较稳定;评委与评价班级的交互作用不显著,与评价项目的交互作用显著。研究结果表明MFRM可以对托幼机构教育质量评价的评委偏差进行个体层面的具体分析,从项目反应理论的视角为托幼机构教育质量评价的评委针对性培训、评估评委的合格性从而建立合格评委库等提供现代教育、心理测量学依据。  相似文献   

16.
多面Rasch模型在结构化面试中的应用   总被引:1,自引:0,他引:1  
孙晓敏  薛刚 《心理学报》2008,40(9):1030-1040
使用项目反应理论中的多面Rasch模型,对66名考生在结构化面试中的成绩进行分析,剔除了由于评委等具体测量情境因素引入的误差对原始分数的影响,得到考生的能力估计值以及个体水平的评分者一致性信息。对基于考生能力估计值和考生面试分得到的决策结果进行比较,发现测量误差的确对决策造成影响,对个别考生的影响甚至相当巨大。进一步使用Facets偏差分析以及评委宽严程度的Facets分析追踪误差源。结果表明,将来自不同面试组的被试进行面试原始成绩的直接比较,评委的自身一致性和评委彼此之间在宽严程度上的差异均将导致误差。研究表明,采用Facets的考生能力估计值作为决策的依据将提高选拔的有效性。同时,Facets分析得到的考生个体层次的评分者一致性指标,以及评委与考生的偏差分析等研究结果还可以为面试误差来源的定位提供详细的诊断信息  相似文献   

17.
The separate questions on an essay test or the individual judges on a rater panel may constitute congeneric parts rather than tau-equivalent parts. Also, it may be necessary to infer the lengths of the congeneric parts from their variances and covariances, rather than from some obvious feature of each part, such as the range of possible scores. Cronbach's alpha coefficient applied to such part-tests data will underestimate total score reliability. Several reliability coefficients are developed for such instruments. They may be regarded as extensions of the coefficient developed by Kristof for a three-part test.  相似文献   

18.
This study compared the clinical rating scales from the Circumplex Model (CCRS), the McMaster Model of Family Functioning (MCRS) and the Family Health Scales (FHS). A central purpose was to investigate whether observational rating scales designed to measure family functioning can also be used to assess couple functioning. One hundred and sixty-six drug abusing women, receiving primary drug treatment in two metropolitan drug agencies in the southwest of the US, and their partners were videotaped while engaged in three couple tasks prior to receiving any treatment. The main findings indicated that all three rating scales have: (1) sufficient interrater reliability; (2) good construct validity, as reflected in factor analyses; and (3) very good convergent validity. However, there was some concern about whether these scales are as discriminating when measuring couples as they are when assessing families.  相似文献   

19.
The aim of this study was to evaluate inter‐rater reliability when using the Swedish version of the Motivational Interviewing Treatment Code (MITI) as an adjunct to MI training, clinical practice and research. Coders were trained to use the MITI for scoring taped sessions. The 4‐month basic training had a duration of 39 hours. Following training, 60 audio‐taped live interviews were randomly assigned for MITI coding. Mean intra‐class correlation (ICC) coefficients were calculated for 7 coders across all pairs of coders. Cronbach's alpha was calculated to estimate the covariance between each pair across their common interviews. Six months later, a second inter‐rater reliability test was performed, when 5 coders coded the same 15 randomly selected tapes. At the second reliability testing the mean ICC was 0.81 and the mean Cronbach's alpha was 0.96. However, the ICC varied for different sub‐variables of the MITI, ranging from 0.42 empathy to 0.79 for number of Closed questions. In conclusion, MITI shows promising potential to be a reliable tool to confirm and enhance MI training as well as practice in clinical settings and in evaluating MI integrity in clinical MI research. However, coder assessment of empathy and MI‐spirit, “global” variables, requires further refinement.  相似文献   

20.
A procedure for evaluating a variety of rater reliability models is presented. A multivariate linear model is utilized to describe and assess a set of ratings. The parameters of such a model are reexpressed in terms of a factor-analytic model. Maximum likelihood methods are employed to estimate and test the parameters in this factor-analytic model. The approach is related to the use of the intraclass correlation coefficient to estimate reliability. Two examples are presented, and the results contrasted to those found with an intraclass correlation approach. Extensions of the procedure to multiple sets of judges, multiple measures, and multiple groups is introduced.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号