首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 187 毫秒
1.
多面Rasch模型在结构化面试中的应用   总被引:1,自引:0,他引:1  
孙晓敏  薛刚 《心理学报》2008,40(9):1030-1040
使用项目反应理论中的多面Rasch模型,对66名考生在结构化面试中的成绩进行分析,剔除了由于评委等具体测量情境因素引入的误差对原始分数的影响,得到考生的能力估计值以及个体水平的评分者一致性信息。对基于考生能力估计值和考生面试分得到的决策结果进行比较,发现测量误差的确对决策造成影响,对个别考生的影响甚至相当巨大。进一步使用Facets偏差分析以及评委宽严程度的Facets分析追踪误差源。结果表明,将来自不同面试组的被试进行面试原始成绩的直接比较,评委的自身一致性和评委彼此之间在宽严程度上的差异均将导致误差。研究表明,采用Facets的考生能力估计值作为决策的依据将提高选拔的有效性。同时,Facets分析得到的考生个体层次的评分者一致性指标,以及评委与考生的偏差分析等研究结果还可以为面试误差来源的定位提供详细的诊断信息  相似文献   

2.
创造力测评中的评分者效应(rater effects)是指在创造性测评过程中, 由于评分者参与而对测评结果造成的影响.评分者效应本质上源于评分者内在认知加工的不同, 具体体现在其评分结果的差异.本文首先概述了评分者认知的相关研究, 以及评分者,创作者,社会文化因素对测评的影响.其次在评分结果层面梳理了评分者一致性信度的指标及其局限, 以及测验概化理论和多面Rasch模型在量化,控制该效应中的应用.最后基于当前研究仍存在的问题, 指出了未来可能的研究方向, 包括深化评分者认知研究,整合不同层面评分者效应的研究, 以及拓展创造力测评方法和技术等.  相似文献   

3.
本文以一个大学社团的团体协作项目活动的参赛大学生为测评对象,运用多面Rasch模型从参赛选手、评分者和测评内容三个侧面考察团队协作能力测评的有效性。结果发现,参赛选手的团队协作能力大都处于中等水平且彼此间相差不大,4位评分者的评分宽严度较低且不一致,评分者在评分过程中产生了偏差。另外,文章还揭示了团队协作能力的结构,为培养大学生团队协作能力提供了参考依据。  相似文献   

4.
主观评分中多面Rasch模型的应用   总被引:1,自引:1,他引:0  
主观评分中存在的不一致性导致主观评分的信度降低。多面Rasch模型基于项目反应理论,可以应用于评分员效应的识别和消除,从而提高主观评分的信度。该文介绍多面Rasch模型的理论和应用框架,介绍了国外相关的典型应用,并且讨论了该模型的应用条件。  相似文献   

5.
HSK主观考试评分的Rasch实验分析   总被引:1,自引:0,他引:1  
主观评分中存在的不一致性导致主观评分的信度降低。多面Rasch模型基于项目反应理论,可以应用于评分员效应的识别和消除,从而提高主观评分的信度。该文介绍多面Rasch模型的理论和应用框架,设计了基于该模型的HSK主观考试评分质量控制应用框架,利用HSK作文评分数据进行了实验验证。  相似文献   

6.
孙晓敏  张厚粲 《心理科学》2005,28(3):646-649
随着素质教育的推进.表现性评价受到越来越多的重视。影响表现性评价结果信度的一个重要因素是评分者之间的不一致。文章使用模拟数据,在对比评分者一致性的相关法、一致性百分比法和概化系数等各种估计方法的基础上,提出概化理论在表现性评价中评分者信度问题上的应用是理论和实践发展的有益方向。  相似文献   

7.
运用多元概化理论对两届临床医学硕士研究生内科临床实践能力考核进行评价比较。结果表明,两届研究生内科临床实践能力考核的信度均较高,可靠性指数分别为0.78878、0.67985,考核内容较全面。比较发现,01级学生考核的信度要高于02级,考核专家以3-5位为宜。  相似文献   

8.
该研究应用GT和多面Rasch模型对结构化面试数据进行分析,并提出一些建议针对某辅导员招聘面试数据,运用GT从宏观上分析应聘者、考官和项目所带来的总体误差大小,在此基础上,运用多面Rasch模型从微观上进一步探查考官严厉度、应聘者能力差异、项目难易度及侧面偏差.结果表明:1)GT分析表明应聘者产生的变异较大(90.65%),说明面试可靠性较高,且当考官数为2时可靠性已较好.2)多面Rasch模型分析出了各侧面效应中的非拟合因素及交互效应中的偏差因素,表明面试误差主要来自考官间严厉度的差异及其自身一致性的不稳定。将GT与多面Rasch模型相结合分析面试数据不仅能测查出评价过程各方面的问题因素,并能更好地作整体把握。  相似文献   

9.
使用多面Rasch模型,从评分量表、评分员等层面对参与2007年八年级语文学业水平测试作文评分的17名评分员的评分情况进行了研究。结果发现:(1)评分员的评分等级所对应的能力值呈正常的变化趋势,大部分评分员有较好的内部一致性;(2)不同评分员的宽严程度有显著差异,评分员之间的一致性整体较好;(3)此外,本文还就评分内部一致性较差的几个评分员的评分做了进一步研究。  相似文献   

10.
概化理论在绩效评估中的应用   总被引:1,自引:0,他引:1  
秦磊  袁登华 《心理科学》2005,28(3):650-651
概化理论借助其理论上的优势在很大程度上克服了基于经典测量理论绩效评估存在的缺陷。它可以对绩效评估中的信度进行更全面的估计,更好地预测和控制误差,并且概化理论的独特视角和方法还为360度绩效评估的效度求取提供了理论支持。  相似文献   

11.
朱宇  冯瑞龙  辛涛 《心理科学》2013,36(2):479-483
本研究以概化理论为视角,搜集了新HSK五级模拟书写题的作答和评分数据,估算了题型、题量、评卷员人数、评阅速度等潜在影响效应的方差分量,考察了新HSK书写成绩的可靠性,并探索了改善该分数可靠性的途径。基于概化理论和规划求解的数据分析发现了题量的调整方案以及题型、题量、评卷员人数的最优组合方案。本研究对评阅速度进行的分析属于前沿性的理论探索,而其他数据分析结果,则可能有益于旨在改进该测试质量的决策实践。  相似文献   

12.
概化理论在作文评分中的应用研究   总被引:30,自引:3,他引:27  
刘远我  张厚粲 《心理学报》1998,31(2):211-218
概化理论是现代心理测量理论之一,该文简要地介绍了该理论的基本思想并用此理论对作文评分的误差控制问题进行了应用性探讨。研究中请6位评分员对20名学生每人三种文体的作文用分项评分法进行评定。然后用GENOVA软件的估计了作文评分中的评分员效应和题目效应,并对各种误差构成进行了分析比较。结果表明:在作文评分中,评分员勺最大,题目效应不明显。同时还发现,不同文体对评分误差有重要影响。论文文的评分误差最大,  相似文献   

13.
This study investigated how personal cognitive style and training effect rating validity with two different rating tasks. Male undergraduate volunteers (n = 53) served as raters and rated videotaped lecturers. Using the Embedded Figures Test to measure cognitive style, two groups of raters were formed: those who tend to structure information presented (articulated) and those who do not (global). Half of each cognitive style received observational training designed to be congruent with the behavioral rating task. All raters completed two rating tasks: one requiring an evaluative judgment and one requiring a judgment of behavior frequency. It was hypothesized that with the evaluative rating task, cognitive style would be and training would not be a significant predictor of validity, because the training was not relevant to the task. It was also hypothesized that with the observational task training would improve rating validity (overcoming cognitive style), because the training was relevant to the rating task. Both hypotheses were supported.I wish to thank Dr. Kevin Murphy for the use of the videotapes.  相似文献   

14.
Traditionally, researchers employ human raters for scoring responses to creative thinking tasks. Apart from the associated costs this approach entails two potential risks. First, human raters can be subjective in their scoring behavior (inter-rater-variance). Second, individual raters are prone to inconsistent scoring patterns (intra-rater-variance). In light of these issues, we present an approach for automated scoring of Divergent Thinking (DT) Tasks. We implemented a pipeline aiming to generate accurate rating predictions for DT responses using text mining and machine learning methods. Based on two existing data sets from two different laboratories, we constructed several prediction models incorporating features representing meta information of the response or features engineered from the response’s word embeddings that were obtained using pre-trained GloVe and Word2Vec word vector spaces. Out of these features, word embeddings and features derived from them proved to be particularly effective. Overall, longer responses tended to achieve higher ratings as well as responses that were semantically distant from the stimulus object. In our comparison of three state-of-the-art machine learning algorithms, Random Forest and XGBoost tended to slightly outperform the Support Vector Regression.  相似文献   

15.
Inter-rater reliability and accuracy are measures of rater performance. Inter-rater reliability is frequently used as a substitute for accuracy despite conceptual differences and literature suggesting important differences between them. The aims of this study were to compare inter-rater reliability and accuracy among a group of raters, using a treatment adherence scale, and to assess for factors affecting the reliability of these ratings. Paired undergraduate raters assessed therapist behavior by viewing videotapes of 4 therapists' cognitive behavioral therapy sessions. Ratings were compared with expert-generated criterion ratings and between raters using intraclass correlation (2,1). Inter-rater reliability was marginally higher than accuracy (p = 0.09). The specific therapist significantly affected inter-rater reliability and accuracy. The frequency and intensity of the therapists' ratable behaviors of criterion ratings correlated only with rater accuracy. Consensus ratings were more accurate than individual ratings, but composite ratings were not more accurate than consensus ratings. In conclusion, accuracy cannot be assumed to exceed inter-rater reliability or vice versa, and both are influenced by multiple factors. In this study, the subject of the ratings (i.e. the therapist and the intensity and frequency of rated behaviors) was shown to influence inter-rater reliability and accuracy. The additional resources needed for a composite rating, a rating based on the average score of paired raters, may be justified by improved accuracy over individual ratings. The additional time required to arrive at a consensus rating, a rating generated following discussion between 2 raters, may not be warranted. Further research is needed to determine whether these findings hold true with other raters and treatment adherence scales.  相似文献   

16.
多面Rasch模型理论及其在结构化面试中的应用   总被引:1,自引:0,他引:1  
针对影响面试效度的各种误差来源,该文引入了一种新颖的面试结果处理方法:多面Rasch模型。这一模型在结构化面试中的应用不但有利于有效测量被试的能力水平,而且为识别问题评委、进一步完善评分规则、实现面试等值等问题都提供了全新的解决思路。文章在对结构化面试信、效度研究进展进行综述的基础上,介绍了多面Rasch模型的理论及其在结构化面试中的应用框架。  相似文献   

17.
Inter‐rater reliability and accuracy are measures of rater performance. Inter‐rater reliability is frequently used as a substitute for accuracy despite conceptual differences and literature suggesting important differences between them. The aims of this study were to compare inter‐rater reliability and accuracy among a group of raters, using a treatment adherence scale, and to assess for factors affecting the reliability of these ratings. Paired undergraduate raters assessed therapist behavior by viewing videotapes of 4 therapists' cognitive behavioral therapy sessions. Ratings were compared with expert‐generated criterion ratings and between raters using intraclass correlation (2,1). Inter‐rater reliability was marginally higher than accuracy (p = 0.09). The specific therapist significantly affected inter‐rater reliability and accuracy. The frequency and intensity of the therapists' ratable behaviors of criterion ratings correlated only with rater accuracy. Consensus ratings were more accurate than individual ratings, but composite ratings were not more accurate than consensus ratings. In conclusion, accuracy cannot be assumed to exceed inter‐rater reliability or vice versa, and both are influenced by multiple factors. In this study, the subject of the ratings (i.e. the therapist and the intensity and frequency of rated behaviors) was shown to influence inter‐rater reliability and accuracy. The additional resources needed for a composite rating, a rating based on the average score of paired raters, may be justified by improved accuracy over individual ratings. The additional time required to arrive at a consensus rating, a rating generated following discussion between 2 raters, may not be warranted. Further research is needed to determine whether these findings hold true with other raters and treatment adherence scales.  相似文献   

18.
相对于其它评价中心技术而言,在无领导小组讨论中考官因素对评分结果的影响尤为重要.本研究主要探讨无领导小组讨论中新手考官的工作记忆与人格对其评分有效性的影响.结果发现,首先,新手考官的评分者一致性较低,评分准确度较差.其次,工作记忆和人格的部分因素分别从不同方面影响新手考官的评分有效性,具体表现在:(1)利他性越强,新手考官评分总均值的准确性越高,且评分结果越宽松;(2)新手考官的决断性越强,对所有应聘者做出有效区分的准确性越高;(3)新手考官的沉稳性越高,对各维度的区分越有效;(4)注意转换和抑制能力对新手考官的晕轮效应及其在各个维度上进行区分的准确度有抑制作用.  相似文献   

19.
黎光明  蒋欢 《心理科学》2019,(3):731-738
包含评分者侧面的测验通常不符合任意一种概化理论设计,因此从概化理论的角度来看这类测验下的数据应属于缺失数据,而决定缺失结构的就是测验的评分方案。用R软件模拟出三种评分方案下的数据,并比较传统法、评价法和拆分法在各评分方案下的估计效果,结果表明:(1)传统法估计准确性较差;(2)评分者一致性较高时,适宜用评价法进行估计;(3)拆分法的估计结果最准确,仅在固定评分者评分方案下需注意评分者与考生数量之比,该比值小于等于0.0047 时估计结果较为准确。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号