首页 | 本学科首页   官方微博 | 高级检索  
     

适用于多维迫选测验的IRT计分模型
引用本文:刘娟,郑蝉金,李云川,连旭. 适用于多维迫选测验的IRT计分模型[J]. 心理科学进展, 2022, 30(6): 1410-1428. DOI: 10.3724/SP.J.1042.2022.01410
作者姓名:刘娟  郑蝉金  李云川  连旭
作者单位:1.北京智鼎优源管理咨询有限公司, 北京 100102;2.华东师范大学教育心理学系;3.华东师范大学上海智能教育研究院, 上海 200062
摘    要:迫选(forced-choice,FC)测验由于可以控制传统李克特方法带来的反应偏差,被广泛应用于非认知测验中,而迫选测验的传统计分方式会产生自模式数据,这种数据由于不适合于个体间的比较,一直备受批评。近年来,多种迫选IRT模型的发展使研究者能够从迫选测验中获得接近常模性的数据,再次引起了研究者与实践人员对迫选IRT模型的兴趣。首先,依据所采纳的决策模型和题目反应模型对6种较为主流的迫选IRT模型进行分类和介绍。然后,从模型构建思路、参数估计方法两个角度对各模型进行比较与总结。其次,从参数不变性检验、计算机化自适应测验(computerized adaptive testing, CAT)和效度研究3个应用研究方面进行述评。最后提出未来研究可以在模型拓展、参数不变性检验、迫选CAT测验和效度研究4个方向深入。

关 键 词:迫选测验  自模式数据  TIRT  MUPP  GGUM-RANK
收稿时间:2021-07-06

IRT-based scoring methods for multidimensional forced choice tests
LIU Juan,ZHENG Chanjin,LI Yunchuan,LIAN Xu. IRT-based scoring methods for multidimensional forced choice tests[J]. Advances In Psychological Science, 2022, 30(6): 1410-1428. DOI: 10.3724/SP.J.1042.2022.01410
Authors:LIU Juan  ZHENG Chanjin  LI Yunchuan  LIAN Xu
Affiliation:1.Beijing Insight Online Management Consulting Co., Ltd., Beijing 100102, China;2.Department of Educational Psychology, East China Normal University, Shanghai 200062, China;3.Shanghai Institute of Artificial Intelligence for Education, East China Normal University, Shanghai 200062, China
Abstract:Forced-choice (FC) test is widely used in non-cognitive testing because of its effectiveness in resisting faking and the response bias caused by traditional Likert method. The traditional scoring of forced-choice test produces ipsative data, which has been criticized for being unsuitable for inter-individual comparisons. In recent years, the development of multiple forced-choice IRT models that allow researchers to obtain normative information from forced-choice test has re-ignited the interest of researchers and practitioners in forced-choice IRT models. The six prevailing forced-choice IRT models in existing studies can be classified according to the adopted decision model and item response model. The TIRT model, the RIM model and the BRB-IRT model for which the decision model is Thurstone’s Law of Comparative Judgment, and the MUPP framework and its derivatives for which the Luce Choice Axiom is adopted. In terms of item response mode, both the MUPP-GGUM and GGUM-RANK models are applicable to items with unfolding response mode, while the other forced-choice models are applicable to items with dominant response mode. In the parameter estimation method, it can also be distinguished according to the estimation algorithm and the estimation process. MUPP-GGUM uses a two-step strategy for parameter estimation, and it uses Likert scale to calibrate item parameters in advance, so that it can facilitate subsequent item bank management, while the others use joint estimation methods. For joint estimation, TIRT uses the traditional estimation algorithms: weighted least squares (WLS)/diagonally weighted least squares (DWLS), both of which are conveniently used in Mplus and take relatively little time, but at the same time they suffer from poor convergence and high computer memory usage in high-dimensional situations. The other model uses the Markov chain Monte Carlo (MCMC) algorithm, which effectively solves the convergence and insufficient memory in traditional algorithms, but the estimation time is longer and much slower than the traditional algorithms. The research on the application of the forced-choice IRT model is summarized in three areas: parameter invariance testing, computerized adaptive testing (CAT) and validity study. Parameter invariance testing can be divided into cross-block consistency and cross-population consistency (also known as DIF), with more research currently focusing on the latter, for example, there are already DIF testing methods for TIRT and RIM. While enriching or upgrading existing DIF testing methods is needed in future research in addition to develop other forced-choice model DIF testing methods so as to be more sensitive to DIF from multiple sources. Non-cognitive tests are usually high-dimensional, and the tests length problems caused by high dimensionality can be naturally addressed by CAT. There are studies that have already explored appropriate item selection strategies for the MUPP-GGUM, GGUM-RANK and RIM models. Future research can continue to explore item selection strategies for different forced-choice IRT models to ensure that the forced-choice CAT test can achieve a balance between measurement precision and test length in high-dimensional context. Validity studies focus on whether the scores obtained from the forced selection IRT model reflect the true characteristics of individuals, as tests that are not validated have huge pitfalls in the interpretation of the results. Some studies have compared IRT scores, traditional scores, and Likert-type scores to see whether IRT scores can yield similar results to Likert scores, whether they perform better than traditional scores in terms of recovery of latent traits. However, the use of Likert scale scores as criterion may introduces response bias as a source of error, and future research can focus on obtaining purer, more convincing criterion.
Keywords:forced choice test  ipsative data  TIRT  MUPP  GGUM-RANK  
点击此处可从《心理科学进展》浏览原始摘要信息
点击此处可从《心理科学进展》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号