首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Rater-mediated assessments, such as teacher behavior rating scales, measure student behavior indirectly through the lens of a rater. As a result, scores from rater-mediated assessments can be influenced by rater effects— individual differences in rater perspectives, attitudes, beliefs, and interpretation of rating scale items. Rater effects are a fundamental aspect of all rater-mediated assessments. However, traditional approaches to evaluate rater effects (i.e., classical test theory, generalizability theory, and multilevel modeling) merely estimate how much score variability is due to the rater. These approaches, while informative, do not offer a solution to the problem. In contrast, Many-facet Rasch measurement (MFRM) approaches estimate and control for rater effects in rater-mediated assessments so that scores are adjusted to account for rater variability. Thus, MFRM offers unique insights into individual- and group-level rater effects that can be used to inform a solution. The resultant purpose of this paper is to introduce MFRM, discuss its advantages for evaluating rater effects in rater-mediated assessments, and demonstrate its use through an applied example.  相似文献   

2.
Research studies in psychology and education often seek to detect changes or growth in an outcome over a duration of time. This research provides a solution to those interested in estimating latent traits from psychological measures that rely on human raters. Rater effects potentially degrade the quality of scores in constructed response and performance assessments. We develop an extension of the hierarchical rater model (HRM), which yields estimates of latent traits that have been corrected for individual rater bias and variability, for ratings that come from longitudinal designs. The parameterization, called the longitudinal HRM (L-HRM), includes an autoregressive time series process to permit serial dependence between latent traits at adjacent timepoints, as well as a parameter for overall growth. We evaluate and demonstrate the feasibility and performance of the L-HRM using simulation studies. Parameter recovery results reveal predictable amounts and patterns of bias and error for most parameters across conditions. An application to ratings from a study of character strength demonstrates the model. We discuss limitations and future research directions to improve the L-HRM.  相似文献   

3.
4.
On 4 of 7 days in each unit of an undergraduate human development course, students responded in writing to specific questions related to instructor notes previously made available to them. The study compared the effects of three writing contingencies on the quality of student writing and performance on major multiple-choice exams in the course. The three contingencies were (1) receiving credit for all writing products each unit, (2) receiving credit for one randomly selected writing product each unit, and (3) receiving no credit for any writing product each unit. On all dimensions of exam performance, writing for daily credit produced higher scores than did writing for random credit and writing for no credit. The daily-writing contingency also produced the highest writing ratings across all units; the writing for random credit produced the next highest writing scores; and the writing for no credit yielded the lowest writing scores. Across all three contingencies, writing scores were highly correlated with performance on multiple-choice exams.  相似文献   

5.
Recent research has questioned the importance of rater perspective effects on multisource performance ratings (MSPRs). Although making a valuable contribution, we hypothesize that this research has obscured evidence for systematic rater source effects as a result of misspecified models of the structure of multisource performance ratings and inappropriate analytic methods. Accordingly, this study provides a reexamination of the impact of rater source on multisource performance ratings by presenting a set of confirmatory factor analyses of two large samples of multisource performance rating data in which source effects are modeled in the form of second-order factors. Hierarchical confirmatory factor analysis of both samples revealed that the structure of multisource performance ratings can be characterized by general performance, dimensional performance, idiosyncratic rater, and source factors, and that source factors explain (much) more variance in multisource performance ratings whereas general performance explains (much) less variance than was previously believed. These results reinforce the value of collecting performance data from raters occupying different organizational levels and have important implications for research and practice.  相似文献   

6.
Advanced writing skills are an important aspect of academic performance as well as of subsequent work- related performance. However, American students rarely attain advanced scores on assessments of writing skills (National Assessment of Educational Progress, 2002). In order to achieve higher levels of writing performance, the working memory demands of writing processes should be reduced so that executive attention is free to coordinate interactions among them. This can in theory be achieved through deliberate practice that trains writers to develop executive control through repeated opportunities to write and through timely and relevant feedback. Automated essay scoring software may offer a way to alleviate the intensive grading demands placed on instructors and, thereby, substantially increase the amount of writing practice that students receive.  相似文献   

7.
关丹丹 《心理学探新》2014,34(5):437-440
为了评价和改进硕士研究生入学考试一般能力测试的写作评分,研究者采用概化理论和多面Rasch分析对113位考生的写作样本的评分误差来源、评分信度等进行了探讨.概化理论研究显示,评分者和题目对评分准确性影响不大,以两道写作题的考试设计而言,评分者为2人即可保证评分信度在0.75以上.多面Rasch分析显示,评分者宽严度的估计值及其误差均在可接受的范围内,评分者之间在宽严度上不存在显著差异,且评分者自身在评分时总体上比较稳定.但个别评分者在特定考生特定题目上表现出特殊偏向.概化理论和多面Rasch分析丰富了写作评分研究的量化指标,证实了硕士研究生入学考试一般能力测试的写作评分具有较高的信度.  相似文献   

8.
Interrater correlations are widely interpreted as estimates of the reliability of supervisory performance ratings, and are frequently used to correct the correlations between ratings and other measures (e.g., test scores) for attenuation. These interrater correlations do provide some useful information, but they are not reliability coefficients. There is clear evidence of systematic rater effects in performance appraisal, and variance associated with raters is not a source of random measurement error. We use generalizability theory to show why rater variance is not properly interpreted as measurement error, and show how such systematic rater effects can influence both reliability estimates and validity coefficients. We show conditions under which interrater correlations can either overestimate or underestimate reliability coefficients, and discuss reasons other than random measurement error for low interrater correlations.  相似文献   

9.
When analysts evaluate performance assessments, they often use modern measurement theory models to identify raters who frequently give ratings that are different from what would be expected, given the quality of the performance. To detect problematic scoring patterns, two rater fit statistics, the infit and outfit mean square error (MSE) statistics are routinely used. However, the interpretation of these statistics is not straightforward. A common practice is that researchers employ established rule-of-thumb critical values to interpret infit and outfit MSE statistics. Unfortunately, prior studies have shown that these rule-of-thumb values may not be appropriate in many empirical situations. Parametric bootstrapped critical values for infit and outfit MSE statistics provide a promising alternative approach to identifying item and person misfit in item response theory (IRT) analyses. However, researchers have not examined the performance of this approach for detecting rater misfit. In this study, we illustrate a bootstrap procedure that researchers can use to identify critical values for infit and outfit MSE statistics, and we used a simulation study to assess the false-positive and true-positive rates of these two statistics. We observed that the false-positive rates were highly inflated, and the true-positive rates were relatively low. Thus, we proposed an iterative parametric bootstrap procedure to overcome these limitations. The results indicated that using the iterative procedure to establish 95% critical values of infit and outfit MSE statistics had better-controlled false-positive rates and higher true-positive rates compared to using traditional parametric bootstrap procedure and rule-of-thumb critical values.  相似文献   

10.
This study examines the effects of organizational differences and rater differences on performance appraisals. Self, peer, and supervisory ratings of performance for nurses in four hospitals and self, student, peer, and supervisory ratings for resident advisors in seven university dormitory complexes were used in this study. The analyses indicate that both organization and rater differences have significant, independent effects on performance ratings. The findings suggest that organizational differences may restrict the generality of the findings of performance appraisal studies across organizational settings. They also may have a negative impact on the usefulness of any particular performance appraisal form in different settings, and on the ability of managers to accurately interpret and compare performance ratings for individuals in different organizational subunits.  相似文献   

11.
The social skills of 20 second- and sixth-grade students were assessed by 20 trained raters using the Social Skills Test for Children (SST-C). Rater and child characteristics were examined to determine whether differences in social skills ratings were due to the race of the rater or the race of the children being rated or due to the interactive effects of these characteristics, which would suggest racial bias in the ratings procedure. The results showed that the race of the rater did affect some behavioral observations. Black raters gave higher scores than white raters on four behavioral categories: response latency, appropriate assertion, effective assertion, and smiling. White raters gave higher scores for head position and gestures. The results of this study replicated earlier findings of significant differences in social skills ratings due to the race and age of the child being rated. The results also showed modest racial bias effects in that black and white raters scored black and white children differentially on two behavioral categories: overall skill ratings and smiling. These results suggested that most behavioral categories of the SST-C were not systematically affected by racial bias. However, the most subjective rating, overall skill, did evidence racial bias effects. This finding is consistent with previous data showing that subjective ratings may be most affected by racial bias.  相似文献   

12.
Assessments consisting of only a few extended constructed response items (essays) are not typically equated using anchor test designs as there are typically too few essay prompts in each form to allow for meaningful equating. This article explores the idea that output from an automated scoring program designed to measure writing fluency (a common objective of many writing prompts) can be used in place of a more traditional anchor. The linear-logistic equating method used in this article is a variant of the Tucker linear equating method appropriate for the limited score range typical of essays. The procedure is applied to historical data. Although the procedure only results in small improvements over identity equating (not equating prompts), it does produce a viable alternative, and a mechanism for checking that the identity equating is appropriate. This may be particularly useful for measuring rater drift or equating mixed format tests.  相似文献   

13.
Recent developments in computerized scoring via semantic distance have provided automated assessments of verbal creativity. Here, we extend past work, applying computational linguistic approaches to characterize salient features of creative text. We hypothesize that, in addition to semantic diversity, the degree to which a story includes perceptual details, thus transporting the reader to another time and place, would be predictive of creativity. Additionally, we explore the use of generative language models to supplement human data collection and examine the extent to which machine-generated stories can mimic human creativity. We collect 600 short stories from human participants and GPT-3, subsequently randomized and assessed on their creative quality. Results indicate that the presence of perceptual details, in conjunction with semantic diversity, is highly predictive of creativity. These results were replicated in an independent sample of stories (n = 120) generated by GPT-4. We do not observe a significant difference between human and AI-generated stories in terms of creativity ratings, and we also observe positive correlations between human and AI assessments of creativity. Implications and future directions are discussed.  相似文献   

14.
In this research we developed and validated an interactive video assessment of conflict resolution skills. A model of conflict management was used to develop the conflict scenarios and part of the scoring key. Computer assessments of conflict resolution skills and two cognitive abilities were administered to 347 supervisors and job performance ratings were collected from their managers. The conflict skills assessment was found to be significantly related to supervisory ratings of on-the-job performance in managing conflict but to be unrelated to the measures of cognitive ability. In addition, the conflict skills assessment had no adverse impact for women. The implications of these results and directions for future research are discussed.  相似文献   

15.
Retrospective rating scales are widely used for formal assessment of typical performance. Raters who are the most familiar/interactive with ratees are routinely recommended to maximize the quality of ratings. This caveat to use the most familiar/interactive raters fails to distinguish sampling parameters of the observations on which ratings are based that may be important to assessing different classes of behavior. We hypothesized that systematic observational schedules would be of greater importance to ratings of public events than familiarity/interaction, per se, while the caveat would hold for ratings of private events. We used the Psychotic Inpatient Profile (PIP), which provides separate factor scores for ratings of public and private events, to examine these hypotheses in a quasi-experimental study with adult inpatients of mental hospitals. A large multiinstitutional data set provided retrospective PIP ratings by two types of raters. The most familiar/interactive local clinical staff for each client completed the PIP after observing on an ad lib schedule, along with ongoing job duties. Unfamiliar, noninteractive raters completed the PIP for each client after observing on a systematic time-sampling schedule for purposes of coding an entirely different instrument. Data were selected so that each of 189 clients received PIP scores from four raters, reflecting functioning during the same time period based on day-shift observations by one rater of each type and evening-shift observations by one rater of each type. Analyses of variance, consistency/discriminability of ratings, and prediction of social-action outcomes all supported the hypotheses. We discuss alternative strategies that are better for assessing typical performance in most circumstances. We also provide recommendations for improving the adequacy of observations for those circumstances in which the standardized retrospective rating scale could be a cost-effective assessment strategy.This study was the basis of a master's thesis at the University of Houston by the senior author under the direction of the junior authors. Richard M. Rozelle served on the examination committee. This study was partially supported by grants to Gordon L. Paul from the National Institute of Mental Health, Public Health Service (MH-15353; MH-25464); the Illinois Department of Mental Health and Developmental Disabilities; the Joyce Foundation; the MacArthur Foundation; the Owsley Foundation; the Cullen Foundation; and the Center for Public Policy of the University of Houston.  相似文献   

16.
朱宇  冯瑞龙  辛涛 《心理科学》2013,36(2):479-483
本研究以概化理论为视角,搜集了新HSK五级模拟书写题的作答和评分数据,估算了题型、题量、评卷员人数、评阅速度等潜在影响效应的方差分量,考察了新HSK书写成绩的可靠性,并探索了改善该分数可靠性的途径。基于概化理论和规划求解的数据分析发现了题量的调整方案以及题型、题量、评卷员人数的最优组合方案。本研究对评阅速度进行的分析属于前沿性的理论探索,而其他数据分析结果,则可能有益于旨在改进该测试质量的决策实践。  相似文献   

17.
探讨了康春花,孙小坚和曾平飞(2016)提出的等级反应多水平侧面模型(GR-MLFM)在包含被试及评分者层面预测变量(完整模型)下的返真性和适用性。结果表明:(1)GR-MLFM完整模型具有逻辑上和数理上的合理性,可用于主观题的评分情境,能较好地检测出评分者效应、影响因素及其影响程度;(2)在数学问题解决的评分实践中,评分员存在两种类型的评分倾向(宽松和严格效应),但绝大多数评分员的宽严度不明显;评分者的责任心可正向预测其严格程度,自信心可正向预测其宽松程度,而情绪稳定性和评分经验的预测作用不显著。  相似文献   

18.
Much of the prior research investigating the influence of cultural values on performance ratings has focused either on conducting cross-national comparisons among raters or using cultural level individualism/collectivism scales to measure the effects of cultural values on performance ratings. Recent research has shown that there is considerable within country variation in cultural values, i.e. people in one country can be more individualistic or collectivistic in nature. Taking the latter perspective, the present study used Markus and Kitayama's (1991) conceptualization of independent and interdependent self-construals as measures of individual variations in cultural values to investigate within culture variations in performance ratings. Results suggest that rater self-construal has a significant influence on overall performance evaluations; specifically, raters with a highly interdependent self-construal tend to show a preference for interdependent ratees, whereas raters high on independent self-construal do not show a preference for specific type of ratees when making overall performance evaluations. Although rater self-construal significantly influenced overall performance evaluations, no such effects were observed for specific dimension ratings. Implications of these results for performance appraisal research and practice are discussed.  相似文献   

19.
This longitudinal study investigates whether developmental changes following 360 degree feedback are predicted by the favourability of ratings received, and moderated by focal individuals' self‐efficacy and perceived importance of feedback. Five developmental criteria are investigated longitudinally: (i) self‐assessments, (ii) line managers' ratings, (iii) amount of developmental activity, (iv) global self‐efficacy and (iv) self‐efficacy for development. Feedback ratings from certain rater groups predicted changes in ratings, but not changes in self‐efficacy or amount of developmental activity. Self‐efficacy significantly moderated the feedback–performance association for certain rater groups, but feedback importance did not. Contrary to expectations, the focal individual's initial self‐assessment predicted changes in self‐efficacy, over the favourability of ratings received. The implications of these findings for organizations using 360 degree feedback for developmental purposes are discussed.  相似文献   

20.
In performance appraisals, some assessors are substantially more lenient than others. Research on this effect in appraisals involving communication and interaction between raters and ratees after the performance evaluation has taken place indicates that it may be at least partly caused by individual differences in assessor personality. However, little is known about the impact or causes of rater severity versus leniency in situations in which there is little or no contact between raters and ratees after the performance evaluation. In Study 1 (N = 174) the strength of the severity–leniency effect in this ‘no‐contact’ context is estimated and found to be similar to that reported for ‘with‐contact’ appraisals. No evidence of an association between assessor personality and assessor severity (vs. leniency) is found in the ‘no‐contact’ context. In Study 2 (N = 54) there is no evidence of an association between the fluid cognitive ability of assessors and the severity of their ratings in a no‐contact context. It is concluded that the severity versus leniency effect probably has a considerable impact on performance ratings in ‘no‐contact’ appraisal settings, but that neither rater personality nor rater cognitive ability appear to play a significant role in this.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号