期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Detection of Test Speededness Using Change-Point Analysis

Can Shao Jun Li Ying Cheng 《Psychometrika》2016,81(4):1118-1141

Change-point analysis (CPA) is a well-established statistical method to detect abrupt changes, if any, in a sequence of data. In this paper, we propose a procedure based on CPA to detect test speededness. This procedure is not only able to classify examinees into speeded and non-speeded groups, but also identify the point at which an examinee starts to speed. Identification of the change point can be very useful. First, it informs decision makers of the appropriate length of a test. Second, by removing the speeded responses, instead of the entire response sequence of an examinee suspected of speededness, ability estimation can be improved. Simulation studies show that this procedure is efficient in detecting both speeded examinees and the speeding point. Ability estimation is dramatically improved by removing speeded responses identified by our procedure. The procedure is then applied to a real dataset for illustration purpose. 相似文献

2.

A Speeded Item Response Model: Leave the Harder till Later

Yu-Wei Chang Rung-Ching Tsai Nan-Jung Hsu 《Psychometrika》2014,79(2):255-274

A speeded item response model is proposed. We consider the situation where examinees may retain the harder items to a later test period in a time limit test. With such a strategy, examinees may not finish answering some of the harder items within the allocated time. In the proposed model, we try to describe such a mechanism by incorporating a speeded-effect term into the two-parameter logistic item response model. A Bayesian estimation procedure of the current model using Markov chain Monte Carlo is presented, and its performance over the two-parameter logistic item response model in a speeded test is demonstrated through simulations. The methodology is applied to physics examination data of the Department Required Test for college entrance in Taiwan for illustration. 相似文献

3.

Computerized adaptive testing,anxiety levels,and gender differences

Barbara E. Fritts Jacob M. Marszalek 《Social Psychology of Education》2010,13(3):441-458

This study compares the amount of test anxiety experienced on a computerized adaptive test (CAT) to a paper-and-pencil test (P&P), as well as the state test anxiety experienced between males and females. Ninety-four middle school CAT examinees were compared to 65 middle school P&P examinees on their responses to the State-Trait Anxiety Inventory for Children (STAIC) after taking a standardized achievement test. Results of a multiple regression showed that P&P examinees had a higher mean STAIC score than CAT examinees after controlling for trait test anxiety and computer anxiety. Evidence of neither a main nor a moderator effect of gender was found. However, a subsequent path analysis gave evidence of an indirect effect of gender on STAIC score mediated by trait test anxiety. Results are discussed in the context of stereotype threat and the implications for the use of CAT in schools, given the digital divide between race and socioeconomic status. Recommendations for future research and practice are offered. 相似文献

4.

Utilizing response times in cognitive diagnostic computerized adaptive testing under the higher-order deterministic input,noisy ‘and’ gate model

Hung-Yu Huang 《The British journal of mathematical and statistical psychology》2020,73(1):109-141

Methods of cognitive diagnostic computerized adaptive testing (CD-CAT) under higher-order cognitive diagnosis models have been developed to simultaneously provide estimates of the attribute mastery statuses of examinees for formative assessment and estimates of a latent continuous trait for overall summative evaluation. In a typical CD-CAT environment, examinees are often subject to a time limit, and the examinees’ response times (RTs) for specific test items can be routinely recorded by custom-made programs. Because examinees are individually administered tailored sets of test items from the item pool, they may experience different levels of speededness during testing and different levels of risk of running out of time. In this study, RTs were considered during the item-selection procedure to control the test speededness and the RTs were treated as useful information for improving latent trait estimation in CD-CAT under the higher-order deterministic input, noisy ‘and’ gate (DINA) model. A modified posterior-weighted Kullback–Leibler (PWKL) method that maximizes the item information per time unit and a shadow-test method that assembles a provisional test subject to a specified time constraint were developed. Two simulation studies were conducted to assess the effects of the proposed methods on the quality of CD-CAT for fixed- and variable-length exams. The results show that, compared with the traditional PWKL method, the proposed methods preserve a lower risk of running out of time while ensuring satisfactory attribute estimation and providing more accurate estimates of the latent trait and speed parameters. Finally, several suggestions for future research are proposed. 相似文献

5.

An exact and optimal standardized person test for assessing consistency with the rasch model 总被引：1，自引：0，他引：1

Karl Christoph Klauer 《Psychometrika》1991,56(2):213-228

The Rasch model predicts that an individual's ability level is invariant over subtests of the total test, and thus, all subtests measure the same latent trait. A person test of this invariance hypothesis is discussed that is uniformly most powerful and standardized in the sense that the conditional distribution of the test statistic, given a particular level of ability, does not depend on the absolute value of the examinee's ability parameter. The test can be routinely performed by applying a computer program designed by and obtainable from the author. Finally, a suboptimal test is derived that is extremely easy to use, and an overall group test of the invariance hypothesis discussed. All tests considered do not rely on asymptotic approximations; hence, they may be applied when the test is of only moderate length and the group of examinees is small. 相似文献

6.

允许检查并修改答案的计算机化自适应测验

陈平丁树良《心理学报》2008,40(6):737-747

采用计算机模拟程序对允许检查并修改答案的计算机化自适应测验（CAT）进行研究,并采用新的评分方式对付Wainer策略。结果表明：综合考虑被试的两次作答信息可以得到更精确的能力估计值。大部分被试进行了修改,只有少部分答案被修改,在被修改的答案中大部分是由错误改为正确;综合Wainer策略CAT的后验分布期望值（EAP）和极大似然估计值（MLE）可以“粗糙”对付Wainer策略相似文献

7.

Controlling item exposure and test overlap on the fly in computerized adaptive testing

Shu‐Ying Chen Pui‐Wa Lei Wen‐Han Liao 《The British journal of mathematical and statistical psychology》2008,61(2):471-492

This paper proposes an on‐line version of the Sympson and Hetter procedure with test overlap control (SHT) that can provide item exposure control at both the item and test levels on the fly without iterative simulations. The on‐line procedure is similar to the SHT procedure in that exposure parameters are used for simultaneous control of item exposure rates and test overlap rate. The exposure parameters for the on‐line procedure, however, are updated sequentially on the fly, rather than through iterative simulations conducted prior to operational computerized adaptive tests (CATs). Unlike the SHT procedure, the on‐line version can control item exposure rate and test overlap rate without time‐consuming iterative simulations even when item pools or examinee populations have been changed. Moreover, the on‐line procedure was found to perform better than the SHT procedure in controlling item exposure and test overlap for examinees who take tests earlier. Compared with two other on‐line alternatives, this proposed on‐line method provided the best all‐around test security control. Thus, it would be an efficient procedure for controlling item exposure and test overlap in CATs. 相似文献

8.

Reporting subscores for institutions

Shelby Haberman Dr Sandip Sinharay Gautam Puhan 《The British journal of mathematical and statistical psychology》2009,62(1):79-95

Recently, there has been an increasing level of interest in reporting subscores for components of larger assessments. This paper examines the issue of reporting subscores at an aggregate level, especially at the level of institutions to which the examinees belong. A new statistical approach based on classical test theory is proposed to assess when subscores at the institutional level have any added value over the total scores. The methods are applied to two operational data sets. For the data under study, the observed results provide little support in favour of reporting subscores for either examinees or institutions. 相似文献

9.

A simple and effective decision rule for choosing a significance test to protect against non-normality

Zimmerman DW 《The British journal of mathematical and statistical psychology》2011,64(3):388-409

There is no formal and generally accepted procedure for choosing an appropriate significance test for sample data when the assumption of normality is doubtful. Various tests of normality that have been proposed over the years have been found to have limited usefulness, and sometimes a preliminary test makes the situation worse. The present paper investigates a specific and easily applied rule for choosing between a parametric and non-parametric test, the Student t test and the Wilcoxon-Mann-Whitney test, that does not require a preliminary significance test of normality. Simulations reveal that the rule, which can be applied to sample data automatically by computer software, protects the Type I error rate and increases power for various sample sizes, significance levels, and non-normal distribution shapes. Limitations of the procedure in the case of heterogeneity of variance are discussed. 相似文献

10.

Investigating Test-Taking Behaviors Using Timing and Process Data

Yi-Hsuan Lee Shelby J. Haberman 《International Journal of Testing》2016,16(3):240-267

The use of computer-based assessments makes the collection of detailed data that capture examinees’ progress in the tests and time spent on individual actions possible. This article presents a study using process and timing data to aid understanding of an international language assessment and the examinees. Issues regarding test-taking strategies, test speededness, test design, and their relationship to examinees’ demographic backgrounds and performance are also discussed. 相似文献

11.

四参数Logistic加权模型下被试能力稳健估计

梅云简小珠刘建平《心理科学》2019,(1):163-169

设计项目参数、被试得分已知的测验情境,在两、三、四参数Logistic加权模型下进行能力估计,发现被试得分等级之间的能力步长存在着均匀的步长间距,被试得分能较好的反映多级记分的分数加权作用。两参数Logistic加权模型下会出现被试参数估计扰动现象,猜测现象会导致能力高估现象,失误现象会导致能力低估现象;三参数Logistic加权模型c型下能力高估现象未出现或不明显;三参数Logistic加权模型γ型下能力低估现象未出现或不明显;四参数Logistic加权模型下被试能力高估现象和低估现象都未出现或不明显,四参数Logistic加权模型是被试能力稳健性估计较好的方法。相似文献

12.

高考数学试卷多维项目反应理论的分析及应用

许志勇丁树良钟君《心理学探新》2013,(5):438-443

高考数学学科试卷的试题综合性较强,一道试题通常考查多种能力属性,而基于单维性假设下的经典测量理论和传统的项目反应理论无法完成该种情形下试卷测量性能分析和考生作答表现分析.本文以MIRT理论为基础,使用CONQUEST软件为工具进行分析,可以获得试卷内部不同能力维度之间的相关以及考生不同维度的能力参数,为提升命题质量和改进教学提供了依据,表明MIRT具有很好的应用前景.由于MIRT理论的复杂性以至于目前分析软件的不足制约其进一步的深入应用,这是今后应该深入研究的问题. 相似文献

13.

Bayesian IRT Guessing Models for Partial Guessing Behaviors 总被引：1，自引：0，他引：1

Jing Cao S. Lynne Stokes 《Psychometrika》2008,73(2):209-230

According to the recent Nation’s Report Card, 12th-graders failed to produce gains on the 2005 National Assessment of Educational Progress (NAEP) despite earning better grades on average. One possible explanation is that 12th-graders were not motivated taking the NAEP, which is a low-stakes test. We develop three Bayesian IRT mixture models to describe the results from a group of examinees including both nonguessers and partial guessers. The first assumes that the guesser answers questions based on his or her knowledge up to a certain test item, and guesses thereafter. The second model assumes that the guesser answers relatively easy questions based on his or her knowledge and guesses randomly on the remaining items. The third is constructed to describe more general low-motivation behavior. It assumes that the guesser gives less and less effort as he or she proceeds through the test. The models can provide not only consistent estimates of IRT parameters but also estimates of each examinee’s nonguesser/guesser status and degree of guessing behavior. We show results of a simulation study comparing the performance of the three guessing models to the 2PL-IRT model. Finally, an analysis of real data from a low-stakes test administered to university students is presented. 相似文献

14.

IRT中最小化χ2/EM参数估计方法

朱玮丁树良陈小攀《心理学报》2006,38(3):453-460

对IRT的双参数Logistic模型（2PLM）中未知参数估计问题,给出了一个新的估计方法――最小化χ2/EM估计。新方法在充分考虑项目反应理论(IRT)与经典测量理论(CTT)之间的差异的前提下,从统计计算的角度改进了Berkson的最小化χ2估计,取消了Berkson实施最小化χ2估计时需要已知能力参数的不合实际的前提,扩大了应用范围。实验结果表明新方法能力参数的估计结果与BILOG相比,精确度要高,且当样本容量超过2000时,项目参数的估计结果也优于BILOG。实验还表明新方法稳健性好相似文献

15.

多题多做测验模型及其应用

丁树良罗芬戴海琦朱玮《心理学报》2007,39(4):730-736

在IRT框架下,建立了0-1评分方式下单维双参数Logistic多题多做（MAMI）测验模型。与Spray给出的一题多做（MASI）模型相比,MAMI不仅模型更加精致,而且扩展了适用范围,参数估计方法也不同,采用EM算法求取项目参数。Monte Carlo模拟结果显示,应用MAMI测验模型与测验题量作相应增加的作法相比,两者给出的能力估计精度相同,但MAMI模型给出的项目参数估计精度更高。如果将MAMI测验模型与被试人数相应增加的作法相比,项目参数的估计精度相同,但MAMI给出的能力参数估计精度更高。这个发现表明,在一定条件下若允许修改答案,并采用累加式记分方式,纵使题量不变,也可使能力估计的精度相当于题量增加一倍的估计精度,而项目参数估计精度也会提高。这些发现不仅对技能评价和认知能力评价有参考价值,而且对数据的处理方式也有参考价值相似文献

16.

J. O. Ramsay 《Psychometrika》1995,60(3):323-339

The probability that an examinee chooses a particular option within an item is estimated by averaging over the responses to that item of examinees with similar response patterns for the whole test. The approach does not presume any latent variable structure or any dimensionality. But simulated and actual data analyses are presented to show that when the responses are determined by a latent ability variable, this similarity-based smoothing procedure can reveal the dimensionality of ability very satisfactorily.The author wishes to acknowledge the support of the Natural Sciences and Engineering Research Council of Canada through grant A320, and to thank Educational Testing Service for making the data on the Advanced Placement Chemistry Exam available. 相似文献

17.

Best linear prediction of composite universe scores

David Jarjoura 《Psychometrika》1983,48(4):525-539

The problem of predicting universe scores for samples of examinees based on their responses to samples of items is treated. A general measurement procedure is described in which multiple test forms are developed from a table of specifications and each form is administered to a different sample of examinees. The measurement model categorizes items according to the cells of such a table, and the linear function derived for minimizing error variance in prediction uses responses to these categories. In addition, some distinctions are drawn between aspects of the approach taken here and the familiar regressed score estimates.The author thanks Robert L. Brennan, Michael J. Kolen, and Richard Sawyer for helpful comments and corrections, and anonymous reviewers for suggested improvements. 相似文献

18.

不同认知结构被试的测验设计模式

彭亚风罗照盛李喻骏高椿雷《心理学报》2018,50(1):130-140

正如不同的病症需要使用不同的医疗技术方法来诊断一样, 不同的认知结构也需要设计对应的测验模式来进行诊断, 从而保证测验具有高质量的诊断评估效果。但传统测验形式未考虑不同认知结构的针对性诊断测验需求, 导致“千人一卷”在测验效率上有所不足; 认知诊断计算机化自适应测验虽可针对不同认知结构的被试施测不同的项目, 然而支持自适应过程的题库却没有针对不同认知结构被试设计对应的项目, 导致题库使用效率较低。要解决上述问题的关键在于, 探索如何针对不同认知结构设计相对应的测验模式。本研究采用Monte Carlo模拟, 对六种属性层级关系下, 不同认知结构的测验设计模式进行探讨。实验结果表明(1)同一属性层级关系下, 不同认知结构的最佳测验设计模式不同; (2)依据不同认知结构的最佳测验设计模式构建的题库具有更高的使用效率。测验编制者可以根据实验结果针对不同认知结构优化对应的测验设计模式, 并用于指导题库建设。相似文献

19.

Heuristic cognitive diagnosis when the Q‐matrix is unknown

下载免费PDF全文

Hans‐Friedrich Köhn Chia‐Yi Chiu Michael J. Brusco 《The British journal of mathematical and statistical psychology》2015,68(2):268-291

Cognitive diagnosis models of educational test performance rely on a binary Q‐matrix that specifies the associations between individual test items and the cognitive attributes (skills) required to answer those items correctly. Current methods for fitting cognitive diagnosis models to educational test data and assigning examinees to proficiency classes are based on parametric estimation methods such as expectation maximization (EM) and Markov chain Monte Carlo (MCMC) that frequently encounter difficulties in practical applications. In response to these difficulties, non‐parametric classification techniques (cluster analysis) have been proposed as heuristic alternatives to parametric procedures. These non‐parametric classification techniques first aggregate each examinee's test item scores into a profile of attribute sum scores, which then serve as the basis for clustering examinees into proficiency classes. Like the parametric procedures, the non‐parametric classification techniques require that the Q‐matrix underlying a given test be known. Unfortunately, in practice, the Q‐matrix for most tests is not known and must be estimated to specify the associations between items and attributes, risking a misspecified Q‐matrix that may then result in the incorrect classification of examinees. This paper demonstrates that clustering examinees into proficiency classes based on their item scores rather than on their attribute sum‐score profiles does not require knowledge of the Q‐matrix, and results in a more accurate classification of examinees. 相似文献

20.

Appropriateness measurement: Review,critique and validating studies

Michael V. Levine Fritz Drasgow 《The British journal of mathematical and statistical psychology》1982,35(1):42-56

The test-taking behaviour of some examinees may be so unusual that their test scores cannot be regarded as appropriate measures of their ability. Appropriateness measurement is a model-based approach to the problem of identifying these test scores. The intuitions and basic theory supporting appropriateness measurement are presented together with a critical review of earlier work and a series of interrelated experiments. We conclude that appropriateness measurement techniques are robust to errors in parameter estimation and robust to the presence of unidentified aberrant examinees in the test norming sample. In addition, the frequently criticized ‘three-parameter logistic’ latent trait model was found to be adequate for the detection of spuriously low scores in actual test data. 相似文献