首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 453 毫秒
1.
2.
Test validity is predicated on there being a lack of bias in tasks, items, or test content. It is well-known that factors such as test candidates' mother tongue, life experiences, and socialization practices of the wider community may serve to inject subtle interactions between individuals' background and the test content. When the gender of the test candidate interacts further with these factors, the potential for item bias to influence test performances grows. A dilemma faced by test designers concerns how they can proactively screen test content for possible sources of bias. Conventional practices in many contexts rely on the subjective opinion of review panels in detecting sensitive topical content and potentially biased material and items. In the last 2 decades this practice has been rivaled by the increased availability of item bias diagnostic software. Few studies have compared the relative accuracy and cost utility of the two approaches in the domain of language assessment. This study makes just that comparison. A 4-passage, 20-item reading comprehension test was given to a stratified sample of 825 high school students and college undergraduates at 5 Japanese institutions. The sampling included a focus group of 468 female students compared to a reference group of 357 male English as a foreign language (EFL) learners. The test passages and items were also given to a panel of 97 in-service and preservice EFL teachers for subjective ratings of potential gender bias. The results of the actual item responses were then empirically checked for evidence of differential item functioning using Simultaneous Item Bias analysis, the Mantel-Haenszel Delta method, and logistic regression. Concordance analyses of the subjective and objective methods suggest that subjective screening of bias overestimates the extent of actual item bias. Implications for cost-effective approaches to item bias detection are discussed.  相似文献   

3.
自编235个图形推理测验题目。采用铆测验等值设计,以72个联合型瑞文测验题目为铆题,对初中到大学各能力层次的1733名男性进行了测验。使用BILOG MG3.0(边际极大似然估计)对实测数据进行了分析,采用Logsitic 3参数模型。剔除数据与模型拟合不好的题目以及信息函数最大值小于0.3的题目,最终建立一个包含181道题目的题库。该题库可以用于淘汰智力较低的应征青年  相似文献   

4.
Effects of the testing situation on item responding: cause for concern   总被引:6,自引:0,他引:6  
The effects of faking on personality test scores have been studied previously by comparing (a) experimental groups instructed to fake or answer honestly, (b) subgroups created from a single sample of applicants or nonapplicants by using impression management scores, and (c) job applicants and nonapplicants. In this investigation, the latter 2 methods were used to study the effects of faking on the functioning of the items and scales of the Sixteen Personality Factor Questionnaire. A variety of item response theory methods were used to detect differential item/test functioning, interpreted as evidence of faking. The presence of differential item/test functioning across testing situations suggests that faking adversely affects the construct validity of personality scales and that it is problematic to study faking by comparing groups defined by impression management scores.  相似文献   

5.
This article attempts to present emotioncy as a potential source of test bias to inform the analysis of test item performance. Emotioncy is defined as a hierarchy, ranging from exvolvement (auditory, visual, and kinesthetic) to involvement (inner and arch), to emphasize the emotions evoked by the senses. This study hypothesizes that when individuals have high levels of emotioncy for specific words, their test performance may systematically change, resulting in test bias. To this end, 355 individuals were asked to take a 40-item vocabulary test along with the emotioncy scale. Mixed Rasch model was employed to flag differential item functioning items. Results illustrated that the test takers with high emotioncy toward specific words outperformed the ones in the low-emotioncy group, characterizing emotioncy as a potential source of test bias.  相似文献   

6.
7.
Previous studies have concluded that cognitive ability tests are not predictively biased against Hispanic American job applicants because test scores generally overpredict, rather than underpredict, their job performance. However, we highlight two important shortcomings of these past studies and use meta-analytic and computation modeling techniques to address these two shortcomings. In Study 1, an updated meta-analysis of the Hispanic–White mean difference (d-value) on job performance was carried out. In Study 2, computation modeling was used to correct the Study 1 d-values for indirect range restriction and combine them with other meta-analytic parameters relevant to predictive bias to determine how often cognitive ability test scores underpredict Hispanic applicants’ job performance. Hispanic applicants’ job performance was underpredicted by a small to moderate amount in most conditions of the computation model. In contrast to previous studies, this suggests cognitive ability tests can be expected to exhibit predictive bias against Hispanic applicants much of the time. However, some conditions did not exhibit underprediction, highlighting that predictive bias depends on various selection system parameters, such as the criterion-related validity of cognitive ability tests and other predictors used in selection. Regardless, our results challenge “lack of predictive bias” as a rationale for supporting test use.  相似文献   

8.
The speeded performance on simple mental addition problems of 6- and 7-year-old children with and without mild mental retardation is modeled from a person perspective and an item perspective. On the person side, it was found that a single cognitive dimension spanned the performance differences between the two ability groups. However, a discontinuity, or "jump," was observed in the performance of the normal ability group on the easier items. On the item side, the addition problems were almost perfectly ordered in difficulty according to their problem size. Differences in difficulty were explained by factors related to the difficulty of executing nonretrieval strategies. All findings were interpreted within the framework of Siegler's (e.g., R. S. Siegler & C. Shipley, 1995) model of children's strategy choices in arithmetic. Models from item response theory were used to test the hypotheses.  相似文献   

9.
This study examines item bias on Forms L and M of the Peabody Picture Vocabulary Test-Revised(PPVT-R) for a sample of Anglo-American and Mexican-American children. Analyses of variance (ANOVA) were employed to assess item bias as defined by items X ethnicity interactions. Follow-up analyses were performed using a Bonferroni-type procedure on individual item contrasts. Bias as measured by differences in item difficulty was found in both groups; however, there was no clear pattern of items that were more difficult for either group. The small number of items that were more difficult for one ethnic group than for the other, coupled with the high reliability of performance overall for both groups, suggest that bias in content of the PPVT-R is minimal.  相似文献   

10.
Probabilistic reasoning skills are important in various contexts. The aim of the present study was to develop a new instrument (the Probabilistic Reasoning Scale – PRS) to accurately measure low levels of probabilistic reasoning ability in order to identify people with difficulties in this domain. Item response theory was applied to construct the scale, and to investigate differential item functioning (i.e., whether the items were invariant) across genders, educational levels, and languages. Additionally, we tested the validity of the scale by investigating the relationships between the PRS and several other measures. The results revealed that the items had a low level of difficulty. Nonetheless, the discriminative measures showed that the items can discriminate between individuals with different trait levels, and the test information function showed that the scale accurately assesses low levels of probabilistic reasoning ability. Additionally, through investigating differential item functioning, the measurement equivalence of the scale at the item level was confirmed for gender, educational status, and language (i.e., Italian and English). Concerning validity, the results showed the expected correlations with numerical skills, math‐related attitudes, statistics achievement, IQ, reasoning skills, and risky choices both in the Italian and British samples. In conclusion, the PRS is an ideal instrument for identifying individuals who struggle with basic probabilistic reasoning, and who could be targeted by specific interventions. Copyright © 2017 John Wiley & Sons, Ltd.  相似文献   

11.
12.
In a recent empirical study, Starns, Hicks, Brown, and Martin (Memory & Cognition, 36, 1–8 2008) collected source judgments for old items that participants had claimed to be new and found residual source discriminability depending on the old-new response bias. The authors interpreted their finding as evidence in favor of the bivariate signal-detection model, but against the two-high-threshold model of item/source memory. According to the latter, new responses only follow from the state of old-new uncertainty for which no source discrimination is possible, and the probability of entering this state is independent of the old-new response bias. However, when missed old items were presented for source discrimination, the participants could infer that the items had been previously studied. To test whether this implicit feedback led to second retrieval attempts and thus to source memory for presumably unrecognized items, we replicated Starns et al.’s (Memory & Cognition, 36, 1–8 2008) finding and compared their procedure to a procedure without such feedback. Our results challenge the conclusion to abandon discrete processing in source memory; source memory for unrecognized items is probably an artifact of the procedure, by which implicit feedback prompts participants to reconsider their recognition judgment when asked to rate the source of old items in the absence of item memory.  相似文献   

13.
Metamemory judgements and reality monitoring judgements were compared for real and imagined stimuli. Line drawings of everyday items were either perceived or imagined in differing ratios, to (a) investigate people's ability to predict the class of item that would be better recalled (Judgements of Learning, JOL), and the class of item which would be better sourced (Judgements of Source, JOS) in a future recall test, and (b) test the hypothesis that participants would show a bias towards calling remembered items real when the source had been forgotten. Although participants' JOLs indicated that they believed real items would be more memorable than imagined, in both experiments a larger proportion of items from either class (real or imagined) was only recalled when presentation modality was less frequent for that class. By contrast, JOSs were no different for real or imagined items, even though source attribution was more accurate for real than imagined items. An attribution of memories to real rather than to imagined events that often occurs when participants are unsure about the source (labelled a ‘bias towards the real’) was due to phenomenological qualities of the memories. The results are discussed in terms of Johnson and Raye's ( 1981 ) reality‐monitoring model. Copyright © 2002 John Wiley & Sons, Ltd.  相似文献   

14.
This paper describes several simulation studies that examine the effects of capitalization on chance in the selection of items and the ability estimation in CAT, employing the 3-parameter logistic model. In order to generate different estimation errors for the item parameters, the calibration sample size was manipulated (N = 500, 1000 and 2000 subjects) as was the ratio of item bank size to test length (banks of 197 and 788 items, test lengths of 20 and 40 items), both in a CAT and in a random test. Results show that capitalization on chance is particularly serious in CAT, as revealed by the large positive bias found in the small sample calibration conditions. For broad ranges of theta, the overestimation of the precision (asymptotic Se) reaches levels of 40%, something that does not occur with the RMSE (theta). The problem is greater as the item bank size to test length ratio increases. Potential solutions were tested in a second study, where two exposure control methods were incorporated into the item selection algorithm. Some alternative solutions are discussed.  相似文献   

15.
While the Angoff (1971) is a commonly used cut score method, critics ( Berk, 1996; Impara & Plake, 1997 ) argue the Angoff places too‐high cognitive demands on raters. In response to criticisms of the Angoff, a number of modifications to the method have been proposed. Some suggested Angoff modifications include using an iterative rating process, presenting judges with normative data about item performance, revising the rating judgment into a Yes/No decision, assigning relative weights to dimensions within a test, and using item response theory in setting cut scores. In this study, subject matter expert raters were provided with a ‘difficulty anchored’ rating scale to use while making Angoff ratings; this scale can be viewed as a variation of the Angoff normative data modification. The rating scale presented test items having known p‐values as anchors, and served as a simple means of providing normative information to guide the Angoff rating process. Results are discussed regarding reliability of the mean Angoff rating (.73) and the correlation of mean Angoff ratings with item difficulty (observed r ranges from .65 to .73).  相似文献   

16.
When developing and evaluating psychometric measures, a key concern is to ensure that they accurately capture individual differences on the intended construct across the entire population of interest. Inaccurate assessments of individual differences can occur when responses to some items reflect not only the intended construct but also construct-irrelevant characteristics, like a person's race or sex. Unaccounted for, this item bias can lead to apparent differences on the scores that do not reflect true differences, invalidating comparisons between people with different backgrounds. Accordingly, empirically identifying which items manifest bias through the evaluation of differential item functioning (DIF) has been a longstanding focus of much psychometric research. The majority of this work has focused on evaluating DIF across two (or a few) groups. Modern conceptualizations of identity, however, emphasize its multi-determined and intersectional nature, with some aspects better represented as dimensional than categorical. Fortunately, many model-based approaches to modelling DIF now exist that allow for simultaneous evaluation of multiple background variables, including both continuous and categorical variables, and potential interactions among background variables. This paper provides a comparative, integrative review of these new approaches to modelling DIF and clarifies both the opportunities and challenges associated with their application in psychometric research.  相似文献   

17.
A general linear latent trait model for continuous item responses is described. The special unidimensional case for continuous item responses is Joreskog's (1971) model of congeneric item responses. In the context of the unidimensional case model for continuous item responses the concepts of item and test information functions, specific objectivity, item bias, and reliability are discussed; also the application of the model to test construction is shown. Finally, the correspondence with latent trait theory for dichotomous item responses is discussed.  相似文献   

18.
A model-based modification (SIBTEST) of the standardization index based upon a multidimensional IRT bias modeling approach is presented that detects and estimates DIF or item bias simultaneously for several items. A distinction between DIF and bias is proposed. SIBTEST detects bias/DIF without the usual Type 1 error inflation due to group target ability differences. In simulations, SIBTEST performs comparably to Mantel-Haenszel for the one item case. SIBTEST investigates bias/DIF for several items at the test score level (multiple item DIF called differential test functioning: DTF), thereby allowing the study of test bias/DIF, in particular bias/DIF amplification or cancellation and the cognitive bases for bias/DIF.This research was partially supported by Office of Naval Research Cognitive and Neural Sciences Grant N0014-90-J-1940, 4421-548 and National Science Foundation Mathematics Grant NSF-DMS-91-01436. The research reported here is collaborative in every respect and the order of authorship is alphabetical. The assistance of Hsin-hung Li and Louis Roussos in conducting the simulation studies was of great help. Discussions with Terry Ackerman, Paul Holland, and Louis Roussos were very helpful.  相似文献   

19.
A method is proposed for the detection of item bias with respect to observed or unobserved subgroups. The method uses quasi-loglinear models for the incomplete subgroup × test score × Item 1 × ... × itemk contingency table. If subgroup membership is unknown the models are Haberman's incomplete-latent-class models.The (conditional) Rasch model is formulated as a quasi-loglinear model. The parameters in this loglinear model, that correspond to the main effects of the item responses, are the conditional estimates of the parameters in the Rasch model. Item bias can then be tested by comparing the quasi-loglinear-Rasch model with models that contain parameters for the interaction of item responses and the subgroups.The author thanks Wim J. van der Linden and Gideon J. Mellenbergh for comments and suggestions and Frank Kok for empirical data.  相似文献   

20.
Intellectual ability is assessed with the Spot-the-Word (STW) test (A. Baddeley, H. Emslie, & I. Nimmo Smith, 1993) by asking respondents to identify a word in a word-nonword item pair. Results in moderate-sized samples suggest this ability is resistant to decline due to dementia. The authors used a 3-parameter item response theory model to investigate the measurement properties of the STW in a large community-dwelling sample (n=2,480) 60 to 64 years of age. A number of poorly performing items were identified. Substantial guessing was present; however, the number of words correctly identified was found to be an accurate index of ability. Performance was moderately related to a number of tests of cognitive performance and was effectively unrelated to visual acuity and to physical or mental health status. The STW is a promising test of ability that, in the future, may be refined by the deletion or replacement of poorly functioning items.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号