首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Statistically based banding is often considered a viable method for minimizing adverse impact in test‐based employment decisions. By utilizing the standard error of the difference (SED), scores are equated based on the assumption that there is substantial unreliability in any single observed score. However, based on the derivations of Dudek, the formula commonly used to calculate the standard error of measurement (SEM) – a component that is typically used to calculate the SED – is incorrect. Specifically, utilizing the SEM when calculating the SED produces a band of observed scores around a true score, not a band of true scores around an observed score as would be appropriate for banding. This study compares the differences between banding‐based selection decisions when the appropriate SED formula – which utilizes the standard error of estimate – is and is not applied. Overall, results suggest that utilizing the appropriate formula for calculating the SED produces substantial variations in employment decisions. The potential legal and ethical implications of these discrepancies are discussed.  相似文献   

2.
A particular form of test score banding, in which bands are based on the reliability of the test and in which selection within bands takes into account criteria that are likely to enhance workforce diversity, has been proposed as an alternative to the traditional top-down (rank-order) hiring systems, but it has been hotly debated among both scientists and practitioners. In a question-and-answer format, this article presents three different viewpoints (proponents, critics, and neutral observers) on the scientific, legal, and practical issues. The article also attempts to seek some consensus among experts on this controversial procedure.  相似文献   

3.
The Leiter-3 is a nonverbal assessment that evaluates cognitive abilities and has been adapted for use in Scandinavia. Generalizability of United States-based normative scoring for use with the Scandinavian population was evaluated. Leiter-3 scores from a sample of Scandinavian students were compared with scores obtained from the Leiter-3 standardization sample, controlling for confounding variables, across ages, using mixed-methods analysis. A Scandinavian-population-based sample was created from Leiter-3 standardization data and norms were constructed and were used to generate standardized scores from the sample data. Results suggest that overall the Scandinavian test-takers score higher than American test-takers, but that differences between groups were minimized when controlling for factors that may influence cognitive performance. Creating Scandinavian based scores was not effective at reducing gaps in performance, suggesting that differences in performance between the different populations may be attributable to factors other than those typically controlled for when constructing standardized tests. Implications of these results and recommendations for Leiter-3 adaptation are reviewed.  相似文献   

4.
Federal and state court cases were reviewed to determine the legality of banding. Banding specifies a range of test scores that are considered equivalent for selection purposes, which allows the use of other job-related or diversity factors to select among candidates within a particular band. Although the Supreme Court has not ruled on the legality of banding, state, district, and appellate courts have upheld different types of banding (e.g., fixed, sliding, random) under the 14th Amendment, Title VII of the Civil Rights Act of 1964, and the Civil Rights Act of 1991. However, the case review indicated that banding is less likely to survive legal scrutiny when minority preference is the only factor used to choose among candidates within a band. Implications for organizations using or considering banding are discussed.  相似文献   

5.
HORST P 《Psychometrika》1948,13(3):125-134
A battery of pencil-and-paper tests is commonly used for predicting a single criterion. If the score on each test is the number of correct answers, the composite battery score would normally be the sum of the weighted test scores, where the weights are the raw score regression weights. Knowing the reliability of each test, it is possible to alter the lengths of the tests in a manner such that the weights will all be equal. The composite battery score would then simply be the total number of items answered correctly and scoring would be greatly simplified. Such simplification is particularly desirable where the volume of testing is large. Section I of the article outlines the procedure for altering the lengths of the tests, and Section II gives a proof of the method.  相似文献   

6.
In the task-switching paradigm, the latency switch-cost score—the difference in mean reaction time between switch and nonswitch trials—is the traditional measure of task-switching ability. However, this score does not reflect accuracy, where switch costs may also emerge. In two experiments that varied in response deadlines (unlimited vs. limited time), we evaluated the measurement properties of two traditional switch-cost scoring methods (the latency switch-cost score and the accuracy switch-cost score) and three alternatives (a rate residual score, a bin score, and an inverse efficiency score). Scores from the rate residual, bin score, and inverse efficiency methods had comparable reliability for latency switch-cost scores without response deadlines but were more reliable than latency switch-cost scores when higher error rates were induced with a response deadline. All three alternative scoring methods appropriately accounted for differences in accuracy switch costs when higher error rates were induced, whereas pure latency switch-cost scores did not. Critically, only the rate residual and bin score methods were more valid indicators of task-switching ability; they demonstrated stronger relationships with performance on an independent measure of executive functioning (the antisaccade analogue task), and they allowed the detection of larger effect sizes when examining within-task congruency effects. All of the three alternative scoring methods provide researchers with a better measure of task-switching ability than do traditional scoring methods, because they each simultaneously account for latency and accuracy costs. Overall, the three alternative scoring methods were all superior to the traditional latency switch-cost scoring method, but the strongest methods were the rate residual and bin score methods.  相似文献   

7.
Although difference scores are widely used in classifying children as learning-disabled, their psychometric properties are often not well understood. Such scores generally contain more error than single test scores. Reliability and standard error of measurement figures for several combinations of ability and achievement measures are presented. The rates and types of errors that occur when such scores are used to classify children as learning-disabled are discussed. Three recommendations for using difference scores are given: (a) combinations of ability and achievement tests that yield difference score reliabilities higher than .80 should be used when classifying children; (b) scores should be reported as a band of scores (± one standard error of measurement) to inform decision-makers regarding the amount of error estimated to be in the score, and (c) the criterion score for classifying the learning disabled should be set after consideration of the rate and types of errors likely to occur.  相似文献   

8.
Multiple‐choice tests are frequently used in personnel selection contexts to measure knowledge and abilities. Option weighting is an alternative multiple‐choice scoring procedure that awards partial credit for incomplete knowledge reflected in applicants’ distractor choices. We investigated whether option weights should be based on expert judgment or on empirical data when trying to outperform conventional number‐right scoring in terms of reliability and validity. To obtain generalizable results, we used repeated random sub‐sampling validation and found that empirical option weighting, but not expert option weighting, increased the reliability of a knowledge test. Neither option weighting procedure improved test validity. We recommend to improve the reliability of existing ability and knowledge tests used for personnel selection by computing and publishing empirical option weights.  相似文献   

9.
Reliability generalization (RG) is a meta-analytic technique that allows for the systematic examination of variation in score reliability for different samples of test takers; this procedure is based on the recognition that reliability is not a stable property of a test but is sample dependent. As a demonstration of an RG analysis, I obtained 63 reliability coefficients for each of the MMPI-2 (Butcher et al., 2001) Personality Psychopathology 5 (Harkness, McNulty, & Ben-Porath, 1995) scales. The overall variability of alpha coefficients supports the argument that reliability is sample dependent and underscores the need for researchers to calculate reliability estimates based on their research samples rather than simply citing published alpha coefficients as evidence of score reliability. I observed statistically significant mean reliability differences for scores across the 5 scales, with the highest level of reliability observed for scores on the measure of Negative Emotionality and the lowest levels of reliability observed for scores on the measures of Aggression and Disconstraint. There was no evidence that the sex-composition of a sample was systematically related to score reliability, and there were no statistically significant differences in reliability between scores obtained with the English version of the test and those obtained with translated forms. However, reliability was consistently lower for scores on some scales when the data were obtained in nonclinical settings as opposed to clinical ones. Sample size was not significantly correlated with reliability estimates. RG methods have the potential for deepening the level of understanding about the role of reliability in the evaluation and use of personality tests.  相似文献   

10.
Posner’s attention network model and Bundesen’s theory of visual attention (TVA) are two influential accounts of attention. Each model has led to the development of a test method: the attention network test (ANT) and TVA-based assessment, respectively. Both tests have been widely used to investigate attentional function in normal and clinical populations. Here we report on the first direct comparison of the ANT to TVA-based assessment. A group of 68 young healthy participants were tested in three consecutive sessions that each contained standard versions of the two tests. The parameters derived from TVA-based assessment had better internal reliability and retest reliability than did those of the standard version of the ANT, where only the executive network score reached comparable levels. However, when corrected for differences in test length, the retest reliability of the orienting network score equaled the least reliable TVA parameters. Both tests were susceptible to practice effects, which improved performance for some parameters while leaving others constant. All pairwise correlations between the eight attention parameters measured by the two tests were small and nonsignificant, with one exception: A strong correlation (r?=?0.72) was found between two parameters of TVA-based assessment, visual processing speed and the capacity of visual short-term memory. We conclude that TVA-based assessment and the ANT measure complementary aspects of attention, but the scores derived from TVA-based assessment are more reliable.  相似文献   

11.
A simulation was used to explore the effects of variations in the rate at which applicants drop out of selection processes on racial differences in selection outcomes. Archival data was used to simulate a realistic range of selection scenarios in which test score differences between groups and selection ratios varied. The basis for dropping out was manipulated in two separate studies. Study 1 simulated dropout decisions that occurred at random within racial subgroups; in this study, dropout rates of minority versus White candidates were varied. Study 2 examined dropout decisions that occurred as a function of test scores. Results from both studies showed that mean test score differences between White and minority applicants have the largest influence on adverse impact. Interventions designed to reduce the tendency of minority applicants to withdraw from selection are likely to have, at best, small effects on the adverse impact of selection tests.  相似文献   

12.
MODELING THE EFFECTS OF BANDING IN PERSONNEL SELECTION   总被引:1,自引:1,他引:0  
Selection outcomes under banding are affected by characteristics of the selection system and the applicant pool. This study examined the effects of eight parameters on the proportions hired from higher- and lower-scoring groups: (a) selection ratio; (b) reliability; (b) fixed vs. sliding bands; (d) top-down vs. random within-band selection; (e) preferential vs. nonpreferential selection; (f) mean differences; (g) standard deviation differences; and (h) proportion of applicants from the lower-scoring group. Simulation results were analyzed in a fully-crossed eight-way ANOVA. Higher-order interactions among selection system and applicant pool characteristics had virtually no effect on selection outcomes; the proportion of the applicant pool from the lower-scoring group accounted for nearly half the variance in out-comes. Other important effects are, in order, the effects of standard deviation differences, mean differences, preferential hiring, and the selection ratio. Applicant pool characteristics have considerably more influence on selection outcomes than do selection system characteristics.  相似文献   

13.
本文提出差异分数的信度变化问题,并以模拟数据分析了差异分数的信度在不同情况下的变化规律。结果指出:1.当两次测试得分的信度系数相等或相近时,两次测试的标准差相差越大,差异分数的信度越高。2.当两次测试得分的信度系数不等时,只要两次施测中任何一次的信度和标准差同时大于另外一次,那么差异分数的信度也比较高。3.无论两次测试的信度关系如何。两次测试相关越低,差异分数的信度越高。  相似文献   

14.
《人类行为》2013,26(3):203-214
Murphy and Myors (this issue) challenge the conclusion by Schmidt (1991) that banding procedures in personnel selection are fatally flawed logically. In this article, we show that their challenge is based on a misunderstanding of the nature of Schmidt's argument. We show that when that argument is properly understood, it does in fact require the conclusion that there is a fatal logical inconsistency between the statistical rationale underlying banding and the operational procedures for banding. The first part of this article demonstrates this for banding in general; the second part elucidates this inconsistency for a particular form of banding that is currently quite popular-the sliding band with minority preference. We show that the comparisons drawn by Murphy and Myors between banding and scoring procedures such as stanine scores and college grades are based on their misunderstanding of the Schmidt argument and are therefore conceptually and mathematically inappropriate and irrelevant.  相似文献   

15.
In an effort to duplicate high interrater reliability coefficients reported in the use of Epley and Ricks' (1963) time orientation scoring system with the Thematic Apperception Test (TAT), two pairs of judges and two different training procedures were employed. Reliability coefficients considerably lower than those quoted by other researchers were found. One method of using the system was to have judges discuss scoring differences during training and at various times during a research project until perfect agreement was reached. When used as an adjunct with periodic assessment' of reliability as judges scored a large number of stories, reliability coefficients within a range acceptable for research purposes were obtained. This procedure is presented with correlational evidence for the presence of the time factor that the scoring system purports to measure.  相似文献   

16.
In many educational tests which involve constructed responses, a traditional test score is obtained by adding together item scores obtained through holistic scoring by trained human raters. For example, this practice was used until 2008 in the case of GRE®General Analytical Writing and until 2009 in the case of TOEFL® iBT Writing. With use of natural language processing, it is possible to obtain additional information concerning item responses from computer programs such as e‐rater®. In addition, available information relevant to examinee performance may include scores on related tests. We suggest application of standard results from classical test theory to the available data to obtain best linear predictors of true traditional test scores. In performing such analysis, we require estimation of variances and covariances of measurement errors, a task which can be quite difficult in the case of tests with limited numbers of items and with multiple measurements per item. As a consequence, a new estimation method is suggested based on samples of examinees who have taken an assessment more than once. Such samples are typically not random samples of the general population of examinees, so that we apply statistical adjustment methods to obtain the needed estimated variances and covariances of measurement errors. To examine practical implications of the suggested methods of analysis, applications are made to GRE General Analytical Writing and TOEFL iBT Writing. Results obtained indicate that substantial improvements are possible both in terms of reliability of scoring and in terms of assessment reliability.  相似文献   

17.
Many questions in the social sciences reduce to a comparison of mean values across groups in a classical analysis of variance F test. Often the original data my come from a set of items in a questionnaire or personality inventory. When this occurs, some sort of data reduction, combining of items, or scaling procedure is first performed before the hypothesis of no difference in mean values across groups can be made. In many cases, this problem causes undue concern t0 a researcher because the effect of the scoring procedure on the distribution of F is not clear. To help solve this problem, this study was undertaken to investigate whether the method used to calculate scores has any effect on the magnitude of the F ratio in an analysis of variance, for, if it were shown that no statistical difference existd, then a researcher would have some justification for showing the procedure having minimal messes. On the other hand, if statistical differences were b arise because of the kind d scaling procedure employed, then a researcher would have to be more cautious in his choice. For this empirical investigation, Guttman, Saaotor, and simple sum scores were generated using item responses from a large pool of high school seniors. No difference in scoring method was detected when the F ratios resulting from each of the three scoring methods were analyzed. This suggests that, for chin analyses, a simple sum score may be as effective as mres derived by more complicated methods.  相似文献   

18.
HONESTY TESTING FOR PERSONNEL SELECTION: A REVIEW AND CRITIQUE   总被引:1,自引:0,他引:1  
Paper and pencil predictors of employee theft are described and studies of validity, reliability, and adverse impact of these tests are examined. Validity studies for 10 tests were grouped into 5 categories: comparisons with polygraph examination results, correlations with admissions of past theft, predictive studies using future job behaviors as criteria, comparisons of shrinkage rates before and after the introduction of a testing program, and comparisons of test scores of groups known to be dishonest with groups representing the general population. While positive correlations were consistently found, a variety of methodological differences between studies were identified which make the direct comparison of test validities suspect. High reliabilities are consistently reported, and test score comparisons by race and sex generally report no differences. Ethical issues in honesty test usage are considered and future research needs are identified.  相似文献   

19.
不同定义平行测验等值的群体不变性   总被引:1,自引:0,他引:1  
群体不变性是等值的一个重要假设,即对不同的考生子群体等值函数一致。本研究对不同平行测验定义下线性等值的群体不变性进行了理论分析和模拟研究,模拟研究REMSD指标通过六种不同加权方式计算。结果显示,严格平行测验在信度较低时REMSD指标更大;子群体均值差异和信度差异对REMSD的影响存在明显的交互作用;REMSD指标在期望权重等权下的最大,在分数权重采用子群体比例加权最小。最后对结果进行了讨论,对REMSD权重使用及进一步研究给出了建议。  相似文献   

20.
Many students and applicants take multiple‐choice tests to demonstrate their competence and achievement. When they are unsure, they guess the most likely answer to maximize their score. Despite the impact of guessing on test reliability and individual performance, studies have not examined how patterns of answer sequences in multiple‐choice tests affect guessing. This research presents the test taker's fallacy, which refers to an individual's tendency to expect a different answer to appear for the next question given a run of the same answer choices. The test taker's fallacy exhibits negative recency, similar to the gambler's fallacy. However, extending the sequential judgment literature, the test taker's fallacy shows that negative recency arises even when sequences may or may not be randomly generated. In three studies, including a survey and experiments, the test taker's fallacy is robustly observed. The test taker's fallacy is consistent with the operation of the representativeness heuristic. This research explains what and how test takers guess given a streak of answers and extends judgment under uncertainty to the test‐taking context.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号