期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Speed-Accuracy Response Models: Scoring Rules based on Response Time and Accuracy

Gunter Maris Han van?der Maas 《Psychometrika》2012,77(4):615-633

Starting from an explicit scoring rule for time limit tasks incorporating both response time and accuracy, and a definite trade-off between speed and accuracy, a response model is derived. Since the scoring rule is interpreted as a sufficient statistic, the model belongs to the exponential family. The various marginal and conditional distributions for response accuracy and response time are derived, and it is shown how the model parameters can be estimated. The model for response accuracy is found to be the two-parameter logistic model. It is found that the time limit determines the item discrimination, and this effect is illustrated with the Amsterdam Chess Test II. 相似文献

2.

Sufficiency and Conditional Estimation of Person Parameters in the Polytomous Rasch Model

David Andrich 《Psychometrika》2010,75(2):292-308

Rasch models are characterised by sufficient statistics for all parameters. In the Rasch unidimensional model for two ordered categories, the parameterisation of the person and item is symmetrical and it is readily established that the total scores of a person and item are sufficient statistics for their respective parameters. In contrast, in the unidimensional polytomous Rasch model for more than two ordered categories, the parameterisation is not symmetrical. Specifically, each item has a vector of item parameters, one for each category, and each person only one person parameter. In addition, different items can have different numbers of categories and, therefore, different numbers of parameters. The sufficient statistic for the parameters of an item is itself a vector. In estimating the person parameters in presently available software, these sufficient statistics are not used to condition out the item parameters. This paper derives a conditional, pairwise, pseudo-likelihood and constructs estimates of the parameters of any number of persons which are independent of all item parameters and of the maximum scores of all items. It also shows that these estimates are consistent. Although Rasch’s original work began with equating tests using test scores, and not with items of a test, the polytomous Rasch model has not been applied in this way. Operationally, this is because the current approaches, in which item parameters are estimated first, cannot handle test data where there may be many scores with zero frequencies. A small simulation study shows that, when using the estimation equations derived in this paper, such a property of the data is no impediment to the application of the model at the level of tests. This opens up the possibility of using the polytomous Rasch model directly in equating test scores. 相似文献

3.

A diagnostic tree model for polytomous responses with multiple strategies

Wenchao Ma 《The British journal of mathematical and statistical psychology》2019,72(1):61-82

Constructed-response items have been shown to be appropriate for cognitively diagnostic assessments because students’ problem-solving procedures can be observed, providing direct evidence for making inferences about their proficiency. However, multiple strategies used by students make item scoring and psychometric analyses challenging. This study introduces the so-called two-digit scoring scheme into diagnostic assessments to record both students’ partial credits and their strategies. This study also proposes a diagnostic tree model (DTM) by integrating the cognitive diagnosis models with the tree model to analyse the items scored using the two-digit rubrics. Both convergent and divergent tree structures are considered to accommodate various scoring rules. The MMLE/EM algorithm is used for item parameter estimation of the DTM, and has been shown to provide good parameter recovery under varied conditions in a simulation study. A set of data from TIMSS 2007 mathematics assessment is analysed to illustrate the use of the two-digit scoring scheme and the DTM. 相似文献

4.

Identification of the 1PL Model with Guessing Parameter: Parametric and Semi-parametric Results

Ernesto San Martín Jean-Marie Rolin Luis M. Castro 《Psychometrika》2013,78(2):341-379

In this paper, we study the identification of a particular case of the 3PL model, namely when the discrimination parameters are all constant and equal to 1. We term this model, 1PL-G model. The identification analysis is performed under three different specifications. The first specification considers the abilities as unknown parameters. It is proved that the item parameters and the abilities are identified if a difficulty parameter and a guessing parameter are fixed at zero. The second specification assumes that the abilities are mutually independent and identically distributed according to a distribution known up to the scale parameter. It is shown that the item parameters and the scale parameter are identified if a guessing parameter is fixed at zero. The third specification corresponds to a semi-parametric 1PL-G model, where the distribution G generating the abilities is a parameter of interest. It is not only shown that, after fixing a difficulty parameter and a guessing parameter at zero, the item parameters are identified, but also that under those restrictions the distribution G is not identified. It is finally shown that, after introducing two identification restrictions, either on the distribution G or on the item parameters, the distribution G and the item parameters are identified provided an infinite quantity of items is available. 相似文献

5.

多题多做测验模型及其应用

丁树良罗芬戴海琦朱玮《心理学报》2007,39(4):730-736

在IRT框架下,建立了0-1评分方式下单维双参数Logistic多题多做（MAMI）测验模型。与Spray给出的一题多做（MASI）模型相比,MAMI不仅模型更加精致,而且扩展了适用范围,参数估计方法也不同,采用EM算法求取项目参数。Monte Carlo模拟结果显示,应用MAMI测验模型与测验题量作相应增加的作法相比,两者给出的能力估计精度相同,但MAMI模型给出的项目参数估计精度更高。如果将MAMI测验模型与被试人数相应增加的作法相比,项目参数的估计精度相同,但MAMI给出的能力参数估计精度更高。这个发现表明,在一定条件下若允许修改答案,并采用累加式记分方式,纵使题量不变,也可使能力估计的精度相当于题量增加一倍的估计精度,而项目参数估计精度也会提高。这些发现不仅对技能评价和认知能力评价有参考价值,而且对数据的处理方式也有参考价值相似文献

6.

Logistic加权模型的理论构建与模拟分析

简小珠戴步云戴海琦《心理学报》2016,48(12):1625-1630

试题难度、试题考查重要性程度加权是多级记分试题的两个基本属性, 因而在IRT项目特征函数中需用不同参数来表示。以往多级记分模型用多个难度参数来描述多级记分试题的难度, 不能有效的表达多级记分试题的分数权重作用。从多级记分试题的分数加权作用角度, 本文提出Logistic加权模型并论述了理论构建思想。在Logistic加权模型下对项目参数估计的EM算法进行推导并编写了相应的参数估计程序。在Logistic加权模型下进行测验模拟, 发现项目参数估计的模拟返真性能良好。相似文献

7.

SEM of another flavour: Two new applications of the supplemented EM algorithm

Li Cai 《The British journal of mathematical and statistical psychology》2008,61(2):309-329

The supplemented EM (SEM) algorithm is applied to address two goodness‐of‐fit testing problems in psychometrics. The first problem involves computing the information matrix for item parameters in item response theory models. This matrix is important for limited‐information goodness‐of‐fit testing and it is also used to compute standard errors for the item parameter estimates. For the second problem, it is shown that the SEM algorithm provides a convenient computational procedure that leads to an asymptotically chi‐squared goodness‐of‐fit statistic for the ‘two‐stage EM’ procedure of fitting covariance structure models in the presence of missing data. Both simulated and real data are used to illustrate the proposed procedures. 相似文献

8.

拓广分部评分模型下计算机自适应测验变加权选题策略

下载免费PDF全文

程小扬丁树良《心理科学》2011,34(4):965-969

摘要: 在计算机自适应测验中, 对0-1评分模型按a-分层选题是高效安全的策略,但多级评分模型的项目难度/步骤参数有多个而无法直接应用这种选题策略。信息函数能够很好地综合项目所有参数及能力参数,但最大信息量选题策略会影响考试安全。本文提出一种变加权选题策略,它通过调用一个与信息量相关联的函数,该函数与信息量成正比,与区分度的某个幂函数成反比,从而达到既能综合项目所有参数又按a分层的效果。在GPCM模型下用蒙特卡罗实验进行比较研究,结果显示新的选题策略总体效果比已有相关结果好。相似文献

9.

An Isotonic Partial Credit Model for Ordering Subjects on the Basis of Their Sum Scores

Rudy Ligtvoet 《Psychometrika》2012,77(3):479-494

In practice, the sum of the item scores is often used as a basis for comparing subjects. For items that have more than two ordered score categories, only the partial credit model (PCM) and special cases of this model imply that the subjects are stochastically ordered on the common latent variable. However, the PCM is very restrictive with respect to the constraints that it imposes on the data. In this paper, sufficient conditions for the stochastic ordering of subjects by their sum score are obtained. These conditions define the isotonic (nonparametric) PCM model. The isotonic PCM is more flexible than the PCM, which makes it useful for a wider variety of tests. Also, observable properties of the isotonic PCM are derived in the form of inequality constraints. It is shown how to obtain estimates of the score distribution under these constraints by using the Gibbs sampling algorithm. A small simulation study shows that the Bayesian p-values based on the log-likelihood ratio statistic can be used to assess the fit of the isotonic PCM to the data, where model-data fit can be taken as a justification of the use of the sum score to order subjects. 相似文献

10.

Admissible probability measurement procedures

Emir H. Shuford Jr. Arthur Albert H. Edward Massengill 《Psychometrika》1966,31(2):125-145

Admissible probability measurement procedures utilize scoring systems with a very special property that guarantees that any student, at whatever level of knowledge or skill, can maximize his expected score if and only if he honestly reflects his degree-of-belief probabilities. Section 1 introduces the notion of a scoring system with the reproducing property and derives the necessary and sufficient condition for the case of a test item with just two possible answers. A method is given for generating a virtually inexhaustible number of scoring systems, both symmetric and asymmetric, with the reproducing property. A negative result concerning the existence of a certain subclass of reproducing scoring systems for the case of more than two possible answers is obtained. Whereas Section 1 is concerned with those instances in which the possible answers to a query are stated in the test itself, Section 2 is concerned with those instances in which the student himself must provide the possible answer(s). In this case, it is shown that a certain minor modification of a scoring system with the reproducing property yields the desired admissible probability measurement procedure.The research reported in this paper was, in part, performed at the Decision Sciences Laboratory in support of Project 4690, Information Processing in Command and Control and, in part, sponsored by the Air Force Systems Command Electronic Systems Division, Decision Sciences Laboratory, under Contract No. AF 19(628)-4304, with ARCON Inc. This report is identified as ESD-TR-65-567. Further reproduction is authorized to satisfy the needs of the U. S. Government. 相似文献

11.

Constant latent odds-ratios models and the mantel-haenszel null hypothesis 总被引：1，自引：0，他引：1

David?J.?Hessen Email author 《Psychometrika》2005,70(3):497-516

In the present paper, a new family of item response theory (IRT) models for dichotomous item scores is proposed. Two basic assumptions define the most general model of this family. The first assumption is local independence of the item scores given a unidimensional latent trait. The second assumption is that the odds-ratios for all item-pairs are constant functions of the latent trait. Since the latter assumption is characteristic of the whole family, the models are called constant latent odds-ratios (CLORs) models. One nonparametric special case and three parametric special cases of the general CLORs model are shown to be generalizations of the one-parameter logistic Rasch model. For all CLORs models, the total score (the unweighted sum of the item scores) is shown to be a sufficient statistic for the latent trait. In addition, conditions under the general CLORs model are studied for the investigation of differential item functioning (DIF) by means of the Mantel-Haenszel procedure. This research was supported by the Dutch Organization for Scientific Research (NWO), grant number 400-20-026. 相似文献

12.

Scoring Depression on a Common Metric: A Comparison of EAP Estimation,Plausible Value Imputation,and Full Bayesian IRT Modeling

H. Felix Fischer Matthias Rose 《Multivariate behavioral research》2019,54(1):85-99

There are a growing number of item response theory (IRT) studies that calibrate different patient-reported outcome (PRO) measures, such as anxiety, depression, physical function, and pain, on common, instrument-independent metrics. In the case of depression, it has been reported that there are considerable mean score differences when scoring on a common metric from different, previously linked instruments. Ideally, those estimates should be the same. We investigated to what extent those differences are influenced by different scoring methods that take into account several levels of uncertainty, such as measurement error (through plausible value imputation) and item parameter uncertainty (through full Bayesian IRT modeling). Depression estimates from different instruments were more similar, and their corresponding confidence/credible intervals were larger when plausible value imputation or Bayesian modeling was used, compared to the direct use of expected a posteriori (EAP) estimates. Furthermore, we explored the use of Bayesian IRT models to update item parameters based on newly collected data. 相似文献

13.

Exploring a Source of Uneven Score Equity across the Test Score Range

Anne Corinne Huggins-Manley Yuxi Qiu Randall D. Penfield 《International Journal of Testing》2018,18(1):50-70

Score equity assessment (SEA) refers to an examination of population invariance of equating across two or more subpopulations of test examinees. Previous SEA studies have shown that score equity may be present for examinees scoring at particular test score ranges but absent for examinees scoring at other score ranges. No studies to date have performed research for the purpose of understanding why score equity can be inconsistent across the score range of some tests. The purpose of this study is to explore a source of uneven subpopulation score equity across the score range of a test. It is hypothesized that the difficulty of anchor items displaying differential item functioning (DIF) is directly related to the score location at which issues of score inequity are observed. The simulation study supports the hypothesis that the difficulty of DIF items has a systematic impact on the uneven nature of conditional score equity. 相似文献

14.

IRT中最小化χ2/EM参数估计方法

朱玮丁树良陈小攀《心理学报》2006,38(3):453-460

对IRT的双参数Logistic模型（2PLM）中未知参数估计问题,给出了一个新的估计方法――最小化χ2/EM估计。新方法在充分考虑项目反应理论(IRT)与经典测量理论(CTT)之间的差异的前提下,从统计计算的角度改进了Berkson的最小化χ2估计,取消了Berkson实施最小化χ2估计时需要已知能力参数的不合实际的前提,扩大了应用范围。实验结果表明新方法能力参数的估计结果与BILOG相比,精确度要高,且当样本容量超过2000时,项目参数的估计结果也优于BILOG。实验还表明新方法稳健性好相似文献

15.

THE EFFECT OF FACTOR SCORES,GUTTMAN SCORES,AND SIMPLE SUM SCORES ON THE SIZE OF F RATIOS IN AN ANALYSIS O F VARIANCE DESIGN

《Multivariate behavioral research》2013,48(4):491-502

Many questions in the social sciences reduce to a comparison of mean values across groups in a classical analysis of variance F test. Often the original data my come from a set of items in a questionnaire or personality inventory. When this occurs, some sort of data reduction, combining of items, or scaling procedure is first performed before the hypothesis of no difference in mean values across groups can be made. In many cases, this problem causes undue concern t0 a researcher because the effect of the scoring procedure on the distribution of F is not clear. To help solve this problem, this study was undertaken to investigate whether the method used to calculate scores has any effect on the magnitude of the F ratio in an analysis of variance, for, if it were shown that no statistical difference existd, then a researcher would have some justification for showing the procedure having minimal messes. On the other hand, if statistical differences were b arise because of the kind d scaling procedure employed, then a researcher would have to be more cautious in his choice. For this empirical investigation, Guttman, Saaotor, and simple sum scores were generated using item responses from a large pool of high school seniors. No difference in scoring method was detected when the F ratios resulting from each of the three scoring methods were analyzed. This suggests that, for chin analyses, a simple sum score may be as effective as mres derived by more complicated methods. 相似文献

16.

认知诊断模型中项目水平模型比较统计量的健壮性

刘彦楼张倩萌郑宗军尹昊《心理科学》2005,(5):1251-1259

使用模拟研究方法比较了以往研究中提出的基于观察信息矩阵、三明治矩阵的Wald（分别表示为W_Obs、W_Sw）、似然比（Likelihood Ratio）统计量以及新提出的基于经验交叉相乘信息矩阵的Wald统计量（W_XPD）在模型——数据失拟条件下进行项目水平上模型比较时的表现。结果显示：（1）W_Sw的一类错误控制率有很强的健壮性。（2）W_XPD在Q矩阵错误设定的大多数条件下的表现优于W_Sw。结论：模型—数据拟合良好时可以使用W_Sw进行项目水平上的模型比较,当模型与数据失拟时W_XPD可能是更好的选择。相似文献

17.

The many null distributions of person fit indices 总被引：1，自引：0，他引：1

Ivo W. Molenaar Herbert Hoijtink 《Psychometrika》1990,55(1):75-106

This paper deals with the situation of an investigator who has collected the scores ofn persons to a set ofk dichotomous items, and wants to investigate whether the answers of all respondents are compatible with the one parameter logistic test model of Rasch. Contrary to the standard analysis of the Rasch model, where all persons are kept in the analysis and badly fittingitems may be removed, this paper studies the alternative model in which a small minority ofpersons has an answer strategy not described by the Rasch model. Such persons are called anomalous or aberrant. From the response vectors consisting ofk symbols each equal to 0 or 1, it is desired to classify each respondent as either anomalous or as conforming to the model. As this model is probabilistic, such a classification will possibly involve false positives and false negatives. Both for the Rasch model and for other item response models, the literature contains several proposals for a person fit index, which expresses for each individual the plausibility that his/her behavior follows the model. The present paper argues that such indices can only provide a satisfactory solution to the classification problem if their statistical distribution is known under the null hypothesis that all persons answer according to the model. This distribution, however, turns out to be rather different for different values of the person's latent trait value. This value will be called ability parameter, although our results are equally valid for Rasch scales measuring other attributes.As the true ability parameter is unknown, one can only use its estimate in order to obtain an estimated person fit value and an estimated null hypothesis distribution. The paper describes three specifications for the latter: assuming that the true ability equals its estimate, integrating across the ability distribution assumed for the population, and conditioning on the total score, which is in the Rasch model the sufficient statistic for the ability parameter.Classification rules for aberrance will be worked out for each of the three specifications. Depending on test length, item parameters and desired accuracy, they are based on the exact distribution, its Monte Carlo estimate and a new and promising approximation based on the moments of the person fit statistic. Results for the likelihood person fit statistic are given in detail, the methods could also be applied to other fit statistics. A comparison of the three specifications results in the recommendation to condition on the total score, as this avoids some problems of interpretation that affect the other two specifications.The authors express their gratitude to the reviewers and to many colleagues for comments on an earlier version. 相似文献

18.

On Compensation in Multidimensional Response Modeling

Wim J. van der Linden 《Psychometrika》2012,77(1):21-30

The issue of compensation in multidimensional response modeling is addressed. We show that multidimensional response models are compensatory in their ability parameters if and only if they are monotone. In addition, a minimal set of assumptions is presented under which the MLEs of the ability parameters are also compensatory. In a recent series of articles, beginning with Hooker, Finkelman, and Schwartzman (2009) in this journal, the second type of compensation was presented as a paradoxical result for certain multidimensional response models, leading to occasional unfairness in maximum-likelihood test scoring. First, it is indicated that the compensation is not unique and holds generally for any multiparameter likelihood with monotone score functions. Second, we analyze why, in spite of its generality, the compensation may give the impression of a paradox or unfairness. 相似文献

19.

认知诊断模型中项目水平模型比较统计量的健壮性

刘彦楼张倩萌郑宗军尹昊《心理科学》2019,(5):1251-1259

使用模拟研究方法比较了以往研究中提出的基于观察信息矩阵、三明治矩阵的Wald（分别表示为W_Obs、W_Sw）、似然比（Likelihood Ratio）统计量以及新提出的基于经验交叉相乘信息矩阵的Wald统计量（W_XPD）在模型——数据失拟条件下进行项目水平上模型比较时的表现。结果显示：（1）W_Sw的一类错误控制率有很强的健壮性。（2）W_XPD在Q矩阵错误设定的大多数条件下的表现优于W_Sw。结论：模型—数据拟合良好时可以使用W_Sw进行项目水平上的模型比较,当模型与数据失拟时W_XPD可能是更好的选择。相似文献

20.

一种非参数化的Q矩阵估计方法：ICC-IR方法开发

汪大勋高旭亮蔡艳涂冬波《心理科学》2018,(2):466-474

摘要：相对于参数化的方法,本研究根据题目测量模式关系开发出ICC指标,并提出基于理想得分的ICC指标法进行Q矩阵估计。Monte Carlo模拟研究与实证研究发现（1）基于理想得分ICC指标法估计Q矩阵具有很好的效果,当属性个数越少、基础题个数越多,估计效果越好。（2）相对于以往方法——D2统计量的方法,ICC-IR法效果更好,并且是一种非参数化的方法,计算简单快捷。（3）实证数据分析表明,ICC-IR法估计的Q矩阵在模型拟合度上也优于D2统计量方法。相似文献