首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
The present study examined several psychometric issues relevant to the use of a favored technique (the Angoff method) used to set standards in criterion referenced testing. The research was conducted within a setting which allowed (a) confident identification of expert and non-expert judges, and (b) estimation of "true" scores for items judged so that accuracy of judgments in addition to reliability could be examined. Results suggested that expertise of judges does make a difference in producing more accurate and reliable data, underscoring the importance of using true subject matter experts (SMEs) in the judgment process. A rater analysis technique (rater-total correlations) was illustrated, which might prove useful in improving the quality of data obtained using the Angoff method, particularly when there is some question regarding the internal consistency of ratings and expertise of some of the raters. Finally, a rater accuracy adjustment/calibration technique was examined and proved to be a potentially useful method to maximize accuracy of a standard derived using the Angoff method in settings where archival normative test data can be obtained. Other methods that could potentially be used to improve Angoff data were discussed.  相似文献   

2.
While the Angoff (1971) is a commonly used cut score method, critics ( Berk, 1996; Impara & Plake, 1997 ) argue the Angoff places too‐high cognitive demands on raters. In response to criticisms of the Angoff, a number of modifications to the method have been proposed. Some suggested Angoff modifications include using an iterative rating process, presenting judges with normative data about item performance, revising the rating judgment into a Yes/No decision, assigning relative weights to dimensions within a test, and using item response theory in setting cut scores. In this study, subject matter expert raters were provided with a ‘difficulty anchored’ rating scale to use while making Angoff ratings; this scale can be viewed as a variation of the Angoff normative data modification. The rating scale presented test items having known p‐values as anchors, and served as a simple means of providing normative information to guide the Angoff rating process. Results are discussed regarding reliability of the mean Angoff rating (.73) and the correlation of mean Angoff ratings with item difficulty (observed r ranges from .65 to .73).  相似文献   

3.
常蕤 《心理学探新》2008,28(4):76-79
水平厘定方法在近二十年来发展迅速,其中Angoff方法的应用最为广泛。该文在已有的方法上提出了一种基于Rasch模型的改进的Angoff方法,该方法使用Rasch模型来估计典型学生的位置,通过均方残差等方法剔除不一致的裁判。文章最后详细阐述了该方法在香港全港性系统评估中的应用。  相似文献   

4.
Recent meta-analytic research has demonstrated structured interviewing to hold acceptable validity and reliability. While the emphasis has been on refining psychometric properties, there is sufficient evidence to suggest a discrepancy between the manner in which interviewing systems should be used and how they are actually used. The present study examined the use of a commercially available structured interviewing system based on past behaviour. 112 candidates were interviewed on two separate occasions by 28 interviewers. Inter-rater reliability was 0.55. The system required the derivation of a consensus score which was found not to differ significantly from the arithmetical mean of the original scores suggesting the process was not undertaken as required. Follow up discussions with interviewers reported three main areas of misunderstanding; lack of role clarity, different interpretations of job specification and inconsistent use of the rating system. Data also suggested interviewers were inexperienced. A critical finding involved female interviewers making assumptions about female applicant's motivations and suitability to the position. This finding was explained by employing ‘person-in-job’ prototypes. The data support the conclusion that although structured interviews may contain appropriate psychometric properties, the application of the system is critical.  相似文献   

5.
Maintaining a stable score scale over time is critical for all standardized educational assessments. Traditional quality control tools and approaches for assessing scale drift either require special equating designs, or may be too time-consuming to be considered on a regular basis with an operational test that has a short time window between an administration and its score reporting. Thus, the traditional methods are not sufficient to catch unusual testing outcomes in a timely manner. This paper presents a new approach for score monitoring and assessment of scale drift. It involves quality control charts, model-based approaches, and time series techniques to accommodate the following needs of monitoring scale scores: continuous monitoring, adjustment of customary variations, identification of abrupt shifts, and assessment of autocorrelation. Performance of the methodologies is evaluated using manipulated data based on real responses from 71 administrations of a large-scale high-stakes language assessment.  相似文献   

6.
The different methods used to score the response options in situational judgment tests (SJTs) carried out as part of the personnel selection process were compared by creating different keys for a single SJT, and the potential benefits of an innovative method combining existing methods were examined. The results, based on a sample of 1,194 candidates, point to some interesting differences between scoring methods. First, the innovative method created the lowest mean, near 60%. Second, the single‐best‐answer method produced the largest variance. The curve of the rank‐ordering method was the closest to a normal distribution. Finally, evidence suggests that the best‐and‐worst‐answer method and the innovative method provide the best results regarding construct validity. In sum, although no clear conclusion could be drawn about which methods should be preferred to score SJTs, results indicate that the new method could prove to be very interesting.  相似文献   

7.
摘 要 使用安戈夫法界定青少年问题性移动社交网络使用评估标准。结果发现,筛查评估问卷中8个及以上题目持肯定回答即可评估为问题性移动社交网络使用,实证区分效度中正常使用者检出率为12%,问题性使用者检出率为91.1%。基于安戈夫法的青少年问题性移动社交网络使用评估工具有较好的心理学测量属性,可用于青少年问题性移动社交网络使用的筛查评估。  相似文献   

8.
Many psychophysical tasks in current use render nonmonotonic psychometric functions; these include the oddball task, the temporal generalization task, the binary synchrony judgment task, and other forms of the same–different task. Other tasks allow for ternary responses and render three psychometric functions, one of which is also nonmonotonic, like the ternary synchrony judgment task or the unforced choice task. In all of these cases, data are usually collected with the inefficient method of constant stimuli (MOCS), because extant adaptive methods are only applicable when the psychometric function is monotonic. This article develops stimulus placement criteria for adaptive methods designed for use with nonmonotonic psychometric functions or with ternary tasks. The methods are transformations of conventional up–down rules. Simulations under three alternative psychophysical tasks prove the validity of these methods, their superiority to MOCS, and the accuracy with which they recover direct estimates of the parameters determining the psychometric functions, as well as estimates of derived quantities such as the point of subjective equality or the difference limen. Practical recommendations and worked-out examples are provided to illustrate how to use these adaptive methods in empirical research.  相似文献   

9.
This paper examines psychometric properties of scores derived from calibration curves (overconfidence, calibration, resolution, and slope) and an analogue of overconfidence that is based on a posttest estimate of the proportion of correctly solved items. Four tests from the theory of fluid and crystallized intelligence were used, and two of these tests employed both sequential and simultaneous methods of item presentation. The results indicate that the overconfidence score not only has the highest reliability, but is the only score with a reliability normally considered adequate for use in individual differences research. There is some, albeit weak, difference in subjects' level of overconfidence between sequential and simultaneous methods of item presentation. Correlational evidence confirms our previous findings that overconfidence scores from perceptual and ‘knowledge’ tasks define the same factor. In agreement with the results of Gigerenzer, Hoffrage and Kleinbolting (1991), subjects' post-test estimates of their performance showed lower levels of overconfidence than did the traditional measures based on subjects' confidence judgment responses to individual items. Also, after controlling for the actual test performances, the post-test performance estimates and average confidence ratings were only slightly positively correlated, suggesting that different psychological processes may underlie these two measures. Finally, our results suggest that average confidence over all items in the test may be a more useful measure in individual differences research than scores derived from calibration curves.  相似文献   

10.
The psychometric function relates an observer's performance to an independent variable, usually some physical quantity of a stimulus in a psychophysical task. This paper, together with its companion paper (Wichmann & Hill, 2001), describes an integrated approach to (1) fitting psychometric functions, (2) assessing the goodness of fit, and (3) providing confidence intervals for the function's parameters and other estimates derived from them, for the purposes of hypothesis testing. The present paper deals with the first two topics, describing a constrained maximum-likelihood method of parameter estimation and developing several goodness-of-fit tests. Using Monte Carlo simulations, we deal with two specific difficulties that arise when fitting functions to psychophysical data. First, we note that human observers are prone to stimulus-independent errors (or lapses). We show that failure to account for this can lead to serious biases in estimates of the psychometric function's parameters and illustrate how the problem may be overcome. Second, we note that psychophysical data sets are usually rather small by the standards required by most of the commonly applied statistical tests. We demonstrate the potential errors of applying traditional chi2 methods to psychophysical data and advocate use of Monte Carlo resampling techniques that do not rely on asymptotic theory. We have made available the software to implement our methods.  相似文献   

11.
I suggest the main goal of Rorschach validation should be a refined understanding of what each score means. Toward this end, I review general issues in construct validity, hurdles unique to the Rorschach, and general limitations with validation criteria. I then recommend two approaches for improving criteria so they can begin approximating the gold standards that are necessary for a refined understanding of what scores actually measure. The first is a method for improving expert clinical judgment, and the second is a method for aggregating data across diverse judges. Finally, the Rorschach Rating Scale (RRS) is presented as a criterion tool to be used with either of these approaches to validation. The RRS is a fairly comprehensive summary of the constructs thought to be measured by various Rorschach scoring systems. The utility of the RRS for research and training are discussed, as are other practical, theoretical, and psychometric issues in its application.  相似文献   

12.
I suggest the main goal of Rorschach validation should be a refined understanding of what each score means. Toward this end, I review general issues in construct validity, hurdles unique to the Rorschach, and general limitations with validation criteria. I then recommend two approaches for improving criteria so they can begin approximating the gold standards that are necessary for a refined understanding of what scores actually measure. The first is a method for improving expert clinical judgment, and the second is a method for aggregating data across diverse judges. Finally, the Rorschach Rating Scale (RRS) is presented as a criterion tool to be used with either of these approaches to validation. The RRS is a fairly comprehensive summary of the constructs thought to be measured by various Rorschach scoring systems. The utility of the RRS for research and training are discussed, as are other practical, theoretical, and psychometric issues in its application.  相似文献   

13.
The presence of sexual arousal to children or a sexual preference for children are commonly hypothesized as being related to child molesting. Sexual arousal and sexual preference do not appear to be accurately assessed by traditional assessment methods such as the clinical interview and traditional personality testing or by projective testing (Earls, 1992). Penile tumescence measurement is an increasingly utilized method for assessing sexual arousal and preference in child molesters. The published literature concerning the psychometric properties of this technology as used with child molesters is critically reviewed. Basic questions concerning the sexual preference hypothesis, the criterion problem, the lack of procedural standardization, the kind of test penile tumescence measures exemplifies, and potentially problematic inferences involved in penile tumescence assessment are examined. There is evidence of test-retest and internal consistency reliabilities for certain penile tumescence measurement procedures. While there are a significant number of studies providing evidence that these techniques can accurately distinguish child abusers from nonoffenders, many are plagued by methodological problems. Suggestions for future research are given.The authors would like to thank Henry Adams, Christopher Earls, Steven R. Gold, D. Richard Laws, Barry M. Maletzky, and several anonymous reviewers for their comments on earlier versions of this paper.  相似文献   

14.
A method is presented for converting the scores on one form of a test to those on another form of the same test. The method is particularly applicable to the case where each form has been administered to a different group and the only link between the two forms is a subset of items common to both. The proposed method, called theitem method of conversion, has been applied to several tests for which other methods of conversion are available for comparison. The necessary data are limited to tests for which the total score is the criterion for item analyses. The method gives highly satisfactory results for all the tests to which it has been applied, particularly when the two groups are rather different, in which case the delta method (a different item method) is inappropriate.The authors are only two of a group, including W. H. Angoff, F. M. Lord, and M. K. Schultz, all of whom have made important contributions to this paper.  相似文献   

15.
The use of unproctored internet‐based testing (UIT) for employee selection is quite widespread. Although this mode of testing has advantages over onsite testing, researchers and practitioners continue to be concerned about potential malfeasance (e.g., cheating and response distortion) under high‐stakes conditions. Therefore, the primary objective of the present study was to investigate the magnitude and extent of high‐ and low‐stakes retest effects on the scores of a UIT speeded cognitive ability test and two UIT personality measures. These data permitted inferences about the magnitude and extent of malfeasant responding. The study objectives were accomplished by implementing two within‐subjects design studies ( Study 1 N=296; Study 2 N=318) in which test takers first completed the tests as job applicants (high‐stakes) or incumbents (low‐stakes) then as research participants (low‐stakes). For the speeded cognitive ability measure, the pattern of test score differences was more consonant with a psychometric practice effect than a malfeasance explanation. This result is likely due to the speeded nature of the test. And for the UIT personality measures, the pattern of higher high‐stakes scores compared with lower low‐stakes scores is similar to those reported for proctored tests in the extant literature. Thus, our results indicate that the use of a UIT administration does not uniquely threaten personality measures in terms of elevated scores under high‐stakes testing that are higher than those observed for proctored tests in the extant literature.  相似文献   

16.
在MCAT中考查四种项目选择指标在有无曝光控制条件下的选题表现。项目选择指标分别是:(1)贝叶斯的D优化方法(D-optimality)、后验期望Kullback-Leibler方法(KLP)、基于等权重复合分数的最小误差方差方法(the minimized error variance of the linear combination score with equal weight,V1)和基于最优权重复合分数的最小误差方差方法(the minimized error variance of the composite score with optimized weight,V2)。将针对认知诊断CAT项目曝光控制的的限制阈值方法(Restrictive Threshold,RT)和限制进度(Restrictive Progressive,RPG)方法、单维CAT中的最大优先指标方法(Maximum Priority Index,MPI)推广到MCAT。模拟研究表明:(1)KLP,D-优化和V1对领域分数估计准确,能力返真性比V2更好。(2)尽管V1和V2方法相比KLP和D-优化方法提高了题库利用率,但这四种选题指标都产生不均匀的项目曝光率分布。(2)三种曝光控制策略都极大地提高项目曝光均匀性,且不明显降低测量精度。(3)MPI与RPG方法在曝光控制方面表现类似,且比RT的方法表现更好。  相似文献   

17.
关于两种Angoff法比较的模拟实验研究   总被引:1,自引:0,他引:1  
采用模拟实验法比较研究了两种Angoff法——概率法和对错法——设定分数线的准确性和稳定性,结果表明:(1)当真能力低于测验的平均难度时,概率法高估分数线,而对错法低估分数线;反之,当真能力高于测验平均难度时,概率法低估,而对错法高估;(2)当真能力接近测验平均难度时,概率法比对错法更准确;反之,当真能力远高于或低于测验平均难度时,对错法更准确;(3)无论在何种实验条件下,概率法均比对错法更稳定。  相似文献   

18.
There is growing interest in organizationally provided or organizationally endorsed coaching. However, little is known about the effects of such coaching on test scores in operational settings. This study reports on an examination of such a program in the context of the use of a situational judgment test (SJT) for medical school admissions. We examine the effects of multiple types of coaching methods on SJT scores and on their construct‐related and predictive validities. Results suggest that (1) commercial coaching techniques may not be as effective as previously thought, whereas organizationally provided methods may be more effective, and that (2) the criterion‐related validity of the SJT scores is not degraded by the availability of coaching. Generally, this study illustrates that concerns about potential unfairness of coaching can be countered by making effective coaching available to all examinees, in the form of organizationally endorsed coaching.  相似文献   

19.
The psychometric function relates an observer’s performance to an independent variable, usually some physical quantity of a stimulus in a psychophysical task. This paper, together with its companion paper (Wichmann & Hill, 2001), describes an integrated approach to (1) fitting psychometric functions, (2) assessing the goodness of fit, and (3) providing confidence intervals for the function’s parameters and other estimates derived from them, for the purposes of hypothesis testing. The present paper deals with the first two topics, describing a constrained maximum-likelihood method of parameter estimation and developing several goodness-of-fit tests. Using Monte Carlo simulations, we deal with two specific difficulties that arise when fitting functions to psychophysical data. First, we note that human observers are prone to stimulus-independent errors (orlapses). We show that failure to account for this can lead to serious biases in estimates of the psychometric function’s parameters and illustrate how the problem may be overcome. Second, we note that psychophysical data sets are usually rather small by the standards required by most of the commonly applied statistical tests. We demonstrate the potential errors of applying traditionalX 2 methods to psychophysical data and advocate use of Monte Carlo resampling techniques that do not rely on asymptotic theory. We have made available the software to implement our methods.  相似文献   

20.
This study examined the psychometric quality of the Affect-Balance Scale (ABS) (Bradburn, 1969) using data collected from 292 middle-aged and older adults, living independently. The dimensionality of the scale was examined, the quality of individual items was tested, and the validity of the ABS was studied. Using a tetrachoric correlation matrix with the robust weighted least squares (WLSMV) estimation method of the Mplus program, we found that two moderately correlated (r = -0.37) constructs are needed to adequately account for the pattern of item scores in the ABS. Two of the 10 ABS items were found to be problematic. When raw sum scores were used in analysis, the correlation between the positive-affect and the negative-affect subscales was lower (r = -0.17), indicating that random and nonrandom measurement error masked the relationship between the two. While affect-balance correlated substantially with five criterion well-being measures, the negative-affect subscale (which constitutes half of the ABS) had a similar pattern of correlations, with only slightly lower magnitude. The theoretical construct of nobreak 'balance' is also questioned. The 'balance' scoring method (subtracting the negative-affect subscale score from the positive-affect subscale score) nets exactly the same score as does summing scores from both subscales together. Accordingly, the summed scores have the very same correlations with other variables as do the balance scores.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号