首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This report documents relationships between differential item functioning (DIF) identification and: (1) item–trait association, and (2) scale multidimensionality in personality assessment. Applying [Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa, ON: Directorate of Human Resources Research and Evaluation, Department of National Defense.] logistic regression model, DIF effect size is found to become increasingly inflated as investigated item associations with trait scores decrease. Similar patterns were noted for the influence of scale multidimensionality on DIF identification. Individuals who investigate DIF in personality assessment applications are provided with estimates regarding the impact of the magnitude of item and trait association and scale multidimensionality on DIF occurrence and effect size. The results emphasize the importance of excluding investigated items in focal trait identification prior to conducting DIF analyses and reporting item and scale psychometric properties in DIF reports.  相似文献   

2.
王卓然  郭磊  边玉芳 《心理学报》2014,46(12):1923-1932
检测项目功能差异(DIF)是认知诊断测验中很重要的问题。首先将逻辑斯蒂克回归法(LR)引入认知诊断测验DIF检测, 然后将LR法与MH法和Wald检验法的DIF检验效果进行比较。在比较中同时考察了匹配变量、DIF种类、DIF大小和受测者人数的影响。结果表明:(1) LR法在认知诊断测验DIF检测中, 检验力较高, 一类错误率较低。(2) LR法在检测认知诊断测验的DIF时, 不受认知诊断方法的影响。(3) LR法可以有效区分一致性DIF和非一致性DIF, 并有较高检验力和较低一类错误率。(4)采用知识状态作为匹配变量, 能够得到较理想的检验力和一类错误率。(5) DIF越大, 受测者人数越多, 统计检验力越高, 但一类错误率不受影响。  相似文献   

3.
We investigated measurement equivalence in two antisocial behavior scales (i.e., one scale for adolescents and a second scale for young adults) by examining differential item functioning (DIF) for respondents from single-parent (n = 109) and two-parent families (n = 447). Even though one item in the scale for adolescents and two items in the scale for young adults showed significant DIF, the two scales exhibited non-significant differential test functioning (DTF). Both uniform and nonuniform DIF were investigated and examples of each type were identified. Specifically, uniform DIF was exhibited in the adolescent scale whereas nonuniform DIF was shown in the young adult scale. Implications of DIF results for assessment of antisocial behavior, along with strengths and limitations of the study, are discussed.  相似文献   

4.
Usually, methods for detection of differential item functioning (DIF) compare the functioning of items across manifest groups. However, the manifest groups with respect to which the items function differentially may not necessarily coincide with the true source of the bias. It is expected that DIF detection under a model that includes a latent DIF variable is more sensitive to this source of bias. In a simulation study, it is shown that a mixture item response theory model, which includes a latent grouping variable, performs better in identifying DIF items than DIF detection methods using manifest variables only. The difference between manifest and latent DIF detection increases as the correlation between the manifest variable and the true source of the DIF becomes smaller. Different sample sizes, relative group sizes, and significance levels are studied. Finally, an empirical example demonstrates the detection of heterogeneity in a minority sample using a latent grouping variable. Manifest and latent DIF detection methods are applied to a Vocabulary test of the General Aptitude Test Battery (GATB).  相似文献   

5.
Abstract

Differential item functioning (DIF) is a pernicious statistical issue that can mask true group differences on a target latent construct. A considerable amount of research has focused on evaluating methods for testing DIF, such as using likelihood ratio tests in item response theory (IRT). Most of this research has focused on the asymptotic properties of DIF testing, in part because many latent variable methods require large samples to obtain stable parameter estimates. Much less research has evaluated these methods in small sample sizes despite the fact that many social and behavioral scientists frequently encounter small samples in practice. In this article, we examine the extent to which model complexity—the number of model parameters estimated simultaneously—affects the recovery of DIF in small samples. We compare three models that vary in complexity: logistic regression with sum scores, the 1-parameter logistic IRT model, and the 2-parameter logistic IRT model. We expected that logistic regression with sum scores and the 1-parameter logistic IRT model would more accurately estimate DIF because these models yielded more stable estimates despite being misspecified. Indeed, a simulation study and empirical example of adolescent substance use show that, even when data are generated from / assumed to be a 2-parameter logistic IRT, using parsimonious models in small samples leads to more powerful tests of DIF while adequately controlling for Type I error. We also provide evidence for minimum sample sizes needed to detect DIF, and we evaluate whether applying corrections for multiple testing is advisable. Finally, we provide recommendations for applied researchers who conduct DIF analyses in small samples.  相似文献   

6.
Recent detection methods for Differential Item Functioning (DIF) include approaches like Rasch Trees, DIFlasso, GPCMlasso and Item Focussed Trees, all of which - in contrast to well established methods - can handle metric covariates inducing DIF. A new estimation method shall address their downsides by mainly aiming at combining three central virtues: the use of conditional likelihood for estimation, the incorporation of linear influence of metric covariates on item difficulty and the possibility to detect different DIF types: certain items showing DIF, certain covariates inducing DIF, or certain covariates inducing DIF in certain items. Each of the approaches mentioned lacks in two of these aspects. We introduce a method for DIF detection, which firstly utilizes the conditional likelihood for estimation combined with group Lasso-penalization for item or variable selection and L1-penalization for interaction selection, secondly incorporates linear effects instead of approximation through step functions, and thirdly provides the possibility to investigate any of the three DIF types. The method is described theoretically, challenges in implementation are discussed. A dataset is analysed for all DIF types and shows comparable results between methods. Simulation studies per DIF type reveal competitive performance of cmlDIFlasso, particularly when selecting interactions in case of large sample sizes and numbers of parameters. Coupled with low computation times, cmlDIFlasso seems a worthwhile option for applied DIF detection.  相似文献   

7.
This study investigated gender based differential item functioning (DIF) in science literacy items included in the Program for International Student Assessment (PISA) 2012. Prior research has suggested presence of such DIF in large scale surveys. Our study extends the empirical literature by examining gender based DIF differences at the country level in order to gain a better overall picture of how cultural and national differences affect occurrence of uniform and nonuniform DIF. Our statistical results indicate existence of widespread gender based DIF in PISA with estimates of percentage of potentially biased items ranging between 2 and 44% (M = 16, SD = 9.9). Our reliance on nationally representative country samples allow these findings to have wide applicability.  相似文献   

8.
As research continues to document differences in the prevalence of mental health problems such as depression across racial/ethnic groups, the issue of measurement equivalence becomes increasingly important to address. The Mood and Feelings Questionnaire (MFQ) is a widely used screening tool for child and adolescent depression. This study applied a differential item functioning (DIF) framework to data from a sample of 6th and 8th grade students in the Seattle Public School District (N = 3,593) to investigate the measurement equivalence of the MFQ. Several items in the MFQ were found to have DIF, but this DIF was associated with negligible individual- or group-level impact. These results suggest that differences in MFQ scores across groups are unlikely to be caused by measurement non-equivalence.  相似文献   

9.
Methods for the identification of differential item functioning (DIF) in Rasch models are typically restricted to the case of two subgroups. A boosting algorithm is proposed that is able to handle the more general setting where DIF can be induced by several covariates at the same time. The covariates can be both continuous and (multi‐)categorical, and interactions between covariates can also be considered. The method works for a general parametric model for DIF in Rasch models. Since the boosting algorithm selects variables automatically, it is able to detect the items which induce DIF. It is demonstrated that boosting competes well with traditional methods in the case of subgroups. The method is illustrated by an extensive simulation study and an application to real data.  相似文献   

10.
Most methods for detecting differential item functioning (DIF) are suitable when the sample sizes are sufficiently large to validate the null statistical distributions. There is no guarantee, however, that they will still perform adequately when there are few respondents in the focal group or in both the reference and the focal group. Angoff's delta plot is a potentially useful alternative for small-sample DIF investigation, but it suffers from an improper DIF flagging criterion. The purpose of this paper is to improve this classification rule under mild statistical assumptions. This improvement yields a modified delta plot with an adjusted DIF flagging criterion for small samples. A simulation study was conducted to compare the modified delta plot with both the classical delta plot approach and the Mantel-Haenszel method. It is concluded that the modified delta plot is consistently less conservative and more powerful than the usual delta plot, and is also less conservative and more powerful than the Mantel-Haenszel method as long as at least one group of respondents is small.  相似文献   

11.
This study used an ideal point response model to examine the extent to which applicants and incumbents differ when responding to personality items. It was hypothesized that applicants' responses would exhibit less folding at high trait levels than incumbents' responses. We used sample data from applicants (N=1,509) and incumbents (N=1,568) who completed the 16 Personality Questionnaire Select. Differential item (DIF) and test functioning (DTF) analyses were conducted using the generalized graded unfolding model, which is based on ideal point model assumptions. Out of the 90 items, 50 showed DIF; however, only 11 were in the hypothesized direction. DTF was significant for 3 of the 12 scales; 2 were in the hypothesized direction.  相似文献   

12.
The paper addresses three neglected questions from IRT. In section 1, the properties of the “measurement” of ability or trait parameters and item difficulty parameters in the Rasch model are discussed. It is shown that the solution to this problem is rather complex and depends both on general assumptions about properties of the item response functions and on assumptions about the available item universe. Section 2 deals with the measurement of individual change or “modifiability” based on a Rasch test. A conditional likelihood approach is presented that yields (a) an ML estimator of modifiability for given item parameters, (b) allows one to test hypotheses about change by means of a Clopper-Pearson confidence interval for the modifiability parameter, or (c) to estimate modifiability jointly with the item parameters. Uniqueness results for all three methods are also presented. In section 3, the Mantel-Haenszel method for detecting DIF is discussed under a novel perspective: What is the most general framework within which the Mantel-Haenszel method correctly detects DIF of a studied item? The answer is that this is a 2PL model where, however, all discrimination parameters are known and the studied item has the same discrimination in both populations. Since these requirements would hardly be satisfied in practical applications, the case of constant discrimination parameters, that is, the Rasch model, is the only realistic framework. A simple Pearsonx 2 test for DIF of one studied item is proposed as an alternative to the Mantel-Haenszel test; moreover, this test is generalized to the case of two items simultaneously studied for DIF.  相似文献   

13.
Abstract

Recent work reframes direct effects of covariates on items in mixture models as differential item functioning (DIF) and shows that, when present in the data but omitted from the fitted latent class model, DIF can lead to overextraction of classes. However, less is known about the effects of DIF on model performance—including parameter bias, classification accuracy, and distortion of class-specific response profiles—once the correct number of classes is chosen. First, we replicate and extend prior findings relating DIF to class enumeration using a comprehensive simulation study. In a second simulation study using the same parameters, we show that, while the performance of LCA is robust to the misspecification of DIF effects, it is degraded when DIF is omitted entirely. Moreover, the robustness of LCA to omitted DIF differs widely based on the degree of class separation. Finally, simulation results are contextualized by an empirical example.  相似文献   

14.
When developing and evaluating psychometric measures, a key concern is to ensure that they accurately capture individual differences on the intended construct across the entire population of interest. Inaccurate assessments of individual differences can occur when responses to some items reflect not only the intended construct but also construct-irrelevant characteristics, like a person's race or sex. Unaccounted for, this item bias can lead to apparent differences on the scores that do not reflect true differences, invalidating comparisons between people with different backgrounds. Accordingly, empirically identifying which items manifest bias through the evaluation of differential item functioning (DIF) has been a longstanding focus of much psychometric research. The majority of this work has focused on evaluating DIF across two (or a few) groups. Modern conceptualizations of identity, however, emphasize its multi-determined and intersectional nature, with some aspects better represented as dimensional than categorical. Fortunately, many model-based approaches to modelling DIF now exist that allow for simultaneous evaluation of multiple background variables, including both continuous and categorical variables, and potential interactions among background variables. This paper provides a comparative, integrative review of these new approaches to modelling DIF and clarifies both the opportunities and challenges associated with their application in psychometric research.  相似文献   

15.
本文对多级计分认知诊断测验的DIF概念进行了界定,并通过模拟实验以及实证研究对四种常见的多级计分DIF检验方法的适用性进行理论以及实践性的探索。研究结果表明:四种方法均能对多级计分认知诊断中的DIF进行有效的检验,且各方法的表现受模型的影响不大;相较于以总分为匹配变量,以KS为匹配变量时更利于DIF的检测;以KS为匹配变量的LDFA方法以及以KS为匹配变量的曼特尔检验方法在检测DIF题目时有着最高的检验力。  相似文献   

16.
基于改进的Wald统计量,将适用于两群组的DIF检测方法拓展至多群组的项目功能差异(DIF)检验;改进的Wald统计量将分别通过计算观察信息矩阵(Obs)和经验交叉相乘信息矩阵(XPD)而得到。模拟研究探讨了此二者与传统计算方法在多个群组下的DIF检验情况,结果表明:(1)Obs和XPD的一类错误率明显低于传统方法,DINA模型估计下Obs和XPD的一类错误率接近理论水平;(2)样本量和DIF量较大时,Obs和XPD具有与传统Wald统计量大体相同的统计检验力。  相似文献   

17.
Student well‐being is a growing issue in higher education, and assessment of the prevalence of conditions as loneliness is therefore important. In higher education and population surveys the Three‐Item Loneliness Scale (T‐ILS) is used increasingly. The T‐ILS is attractive for large multi‐subject surveys, as it consists of only three items (derived from the UCLA Loneliness Scale). Several ways of classifying persons as lonely based on T‐ILS scores exist: dichotomous and trichotomous classification schemes and use of sum scores with rising levels indicating more loneliness. The question remains whether T‐ILS scores are comparable across the different population groups where they are used or across groups of students in the higher education system. The aim was to investigate whether the T‐ILS suffers from differential item functioning (DIF) that might change the loneliness classification among higher education students, using a large sample just admitted to 22 different academy profession degree programs in Denmark (N = 3,757). DIF was tested relative to degree program, age groups and gender. The framework of graphical loglinear Rasch models was applied, as this allows for adjustment of sum scores for uniform DIF, and thus for assessment of whether DIF might change the classification. Two items showed DIF relative to degree program and gender, and adjusting for this DIF changed the classification for some subgroups. The consequences were negligible when using a dichotomous classification and larger when using a trichotomous classification. Therefore, trichotomous classification should be used with caution unless suitable adjustments for DIF are done prior to classification.  相似文献   

18.

Differential item functioning (DIF) statistics were computed using items from the Peabody Individual Achievement Test (PIAT)-Reading Comprehension subtest for children of the same age group (ages 7 through 12 respectively). The pattern of observed DIF items was determined by comparing each cohort across age groups. Differences related to race and gender were also identified within each cohort. Characteristics of DIF items were identified based on sentence length, vocabulary frequency, and density of a sentence. DIF items were more frequently associated with short sentences than with long sentences. This study explored the potential limitation in the longitudinal use of items in an adaptive test.  相似文献   

19.
This study investigated differential item functioning (DIF) mechanisms in the context of differential testlet effects across subgroups. Specifically, we investigated DIF manifestations when the stochastic ordering assumption on the nuisance dimension in a testlet does not hold. DIF hypotheses were formulated analytically using a parametric marginal item response function approach and compared with empirical DIF results from a unidimensional item response theory approach. The comparisons were made in terms of type of DIF (uniform or non‐uniform) and direction (whether the focal or reference group was advantaged). In general, the DIF hypotheses were supported by the empirical results, showing the usefulness of the parametric approach in explaining DIF mechanisms. Both analytical predictions of DIF and the empirical results provide insights into conditions where a particular type of DIF becomes dominant in a specific DIF direction, which is useful for the study of DIF causes.  相似文献   

20.
Differential item functioning (DIF) assessment is key in score validation. When DIF is present scores may not accurately reflect the construct of interest for some groups of examinees, leading to incorrect conclusions from the scores. Given rising immigration, and the increased reliance of educational policymakers on cross-national assessments such as Programme for International Student Assessment, Trends in International Mathematics and Science Study, and Progress in International Reading Literacy Study (PIRLS), DIF with regard to native language is of particular interest in this context. However, given differences in language and cultures, assuming similar cross-national DIF may lead to mistaken assumptions about the impact of immigration status, and native language on test performance. The purpose of this study was to use model-based recursive partitioning (MBRP) to investigate uniform DIF in PIRLS items across European nations. Results demonstrated that DIF based on mother's language was present for several items on a PIRLS assessment, but that the patterns of DIF were not the same across all nations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号