期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Generating items during testing: Psychometric issues and models 总被引：2，自引：0，他引：2

Susan E. Embretson 《Psychometrika》1999,64(4):407-433

On-line item generation is becoming increasingly feasible for many cognitive tests. Item generation seemingly conflicts with the well established principle of measuring persons from items with known psychometric properties. This paper examines psychometric principles and models required for measurement from on-line item generation. Three psychometric issues are elaborated for item generation. First, design principles to generate items are considered. A cognitive design system approach is elaborated and then illustrated with an application to a test of abstract reasoning. Second, psychometric models for calibrating generating principles, rather than specific items, are required. Existing item response theory (IRT) models are reviewed and a new IRT model that includes the impact on item discrimination, as well as difficulty, is developed. Third, the impact of item parameter uncertainty on person estimates is considered. Results from both fixed content and adaptive testing are presented.This article is based on the Presidential Address Susan E. Embretson gave on June 26, 1999 at the 1999 Annual Meeting of the Psychometric Society held at the University of Kansas in Lawrence, Kansas. —Editor 相似文献

2.

计算机适应性测验条件下认知设计项目预测参数的影响

杨向东《心理学报》2010,42(7):802-812

自动化项目生成(Automatic Item Generation)中的项目参数是基于认知项目设计的刺激特征集预测的, 在不确定性来源上较之用经验数据标定的参数更为复杂。文章通过实证研究分析了在计算机适应性测验条件下基于认知设计系统法生成的抽象推理测验(ART)项目预测参数对能力参数估计的精确性。研究表明, 项目预测参数比相应标定参数分布更为趋中。这种回归效应既影响到能力参数估计误差大小, 也导致适应性测验过程中项目选择的差异。在控制了项目选择差异之后, 能力参数估计误差较之基于项目标定参数的能力估计误差大, 但差别并不明显。两者相应的能力估计值相关很高, 对应能力值之间的差异很小, 且几乎贯彻整个能力分布区间。相似文献

3.

Generation and hypermnesia

Mulligan NW 《Journal of experimental psychology. Learning, memory, and cognition》2001,27(2):436-450

The multifactor account of the generation effect makes detailed predictions about the effects of generation on item-specific and relational encoding, predictions confirmed in four experiments using a multiple-test methodology. In pure-list designs with unrelated study items, generation produced more interest item gains (indexing greater item-specific processing) and more interest item losses (indexing less relational processing) relative to the read condition. In a mixed-list design, generation produced more gains but did not affect losses. With categorically-related study items, generation produced more gains but fewer losses (indicating enhanced relational encoding). Generation consistently produced hypermnesia whereas reading did so only for related study items. Also, a significant generation effect emerged on later tests under conditions (between-subjects design, unrelated study items) which typically yield no generation effect. 相似文献

4.

Testing the actual equivalence of automatically generated items

Debora de Chiusole Luca Stefanutti Pasquale Anselmi Egidio Robusto 《Behavior research methods》2018,50(1):39-56

If the automatic item generation is used for generating test items, the question of how the equivalence among different instances may be tested is fundamental to assure an accurate assessment. In the present research, the question was dealt by using the knowledge space theory framework. Two different ways of considering the equivalence among instances are proposed: The former is at a deterministic level and it requires that all the instances of an item template must belong to exactly the same knowledge states; the latter adds a probabilistic level to the deterministic one. The former type of equivalence can be modeled by using the BLIM with a knowledge structure assuming equally informative instances; the latter can be modeled by a constrained BLIM. This model assumes equality constraints among the error parameters of the equivalent instances. An approach is proposed for testing the equivalence among instances, which is based on a series of model comparisons. A simulation study and an empirical application show the viability of the approach. 相似文献

5.

代数应用题项目生成的结构分析方法

杨向东《心理科学进展》2014,22(3):558-570

自动化项目生成是近年来兴起的测量领域, 是一种以项目认知加工理论为基础的原则性项目设计(principled item design)模式。其中, 如何在项目认知模型基础上, 通过任务结构分析的方式系统全面的鉴别和提取任务特征是一个关键环节。基于已有文献中代数应用题的命题分析法、网络语言分析法、关系-函数分析法、任务分析地图等四种结构分析方法, 研究探索了能够服务于自动化项目生成的代数应用题任务结构分析方法。该分析表明, 前三种方法分别对应于个体解题过程需要形成的三种中介表征, 即问题陈述背后的命题表征、事件时空关系的情境模型、以及变量间数量关系的问题模型, 第四种方法从过程角度分析了问题解决的认知需求。然而, 要实现项目生成的特征提取需求, 尚需对现有四种方法所揭示问题特征的心理现实性、特征提取的系统性和完备性、任务领域的适用范围、以及不同方法的整合等问题开展进一步研究。相似文献

6.

基于GPCM的计算机自适应测验选题策略比较 总被引：1，自引：0，他引：1

刘珍丁树良林海菁《心理学报》2008,40(5):618-625

选题策略是计算机自适应测验（Computerized Adaptive Testing , CAT）研究的一项重要内容,它的好坏直接关系到考试的信度、效度及考试的安全性。CAT的许多研究与应用,都建立在0-1二级评分模型基础上,对多级评分CAT的选题策略的研究很少报导。目前国内虽已开展了基于GRM的CAT研究,但基于GPCM的CAT的研究尚未见有关报道。本文通过计算机模拟程序,对基于拓广分部评分模型(Generalized Partial Credit Model, GPCM)下的CAT的四种选题策略在多种情况下进行了比较研究。研究结果表明：被试能力呈正态分布时,选题策略的使用效果与项目步骤参数分布有很大的关系。（1）项目步骤参数均服从正态分布时,采用能力与项目步骤参数匹配选题策略效果最佳;（2）项目步骤参数均服从均匀分布时,能力与项目步骤参数平均数匹配选题策略效果最佳相似文献

7.

Latent class models for testing monotonicity and invariant item ordering for polytomous items

Ligtvoet R Vermunt JK 《The British journal of mathematical and statistical psychology》2012,65(2):237-250

Two assumptions that are relevant to many applications using item response theory are the assumptions of monotonicity (M) and invariant item ordering (IIO). A latent class model is proposed for ordinal items with inequality constraints on the class-specific item means. This model is used as a tool for testing for violations of M and IIO. A Gibbs sampling scheme is used for estimating the model parameters. It is shown that the deviance information criterion can be used as an overall test of M and IIO, while posterior predictive checks can be used to test these assumptions at the item level. A real data application illustrates a model-fitting strategy for detecting items that violate M and IIO. 相似文献

8.

Optimal Online Calibration Designs for Item Replenishment in Adaptive Testing

He Yinhong Chen Ping 《Psychometrika》2020,85(1):35-55

The maintenance of item bank is essential for continuously implementing adaptive tests. Calibration of new items online provides an opportunity to efficiently replenish items for the operational item bank. In this study, a new optimal design for online calibration (referred to as D-c) is proposed by incorporating the idea of original D-optimal design into the reformed D-optimal design proposed by van der Linden and Ren (Psychometrika 80:263–288, 2015) (denoted as D-VR design). To deal with the dependence of design criteria on the unknown item parameters of new items, Bayesian versions of the locally optimal designs (e.g., D-c and D-VR) are put forward by adding prior information to the new items. In the simulation implementation of the locally optimal designs, five calibration sample sizes were used to obtain different levels of estimation precision for the initial item parameters, and two approaches were used to obtain the prior distributions in Bayesian optimal designs. Results showed that the D-c design performed well and retired smaller number of new items than the D-VR design at almost all levels of examinee sample size; the Bayesian version of D-c using the prior obtained from the operational items worked better than that using the default priors in BILOG-MG and PARSCALE; and Bayesian optimal designs generally outperformed locally optimal designs when the initial item parameters of the new items were poorly estimated.

相似文献

9.

Abortion attitudes, 1984-1987-1988: effects of item order and dimensionality.

E Tenvergert M W Gillespie J Kingma H Klasen 《Perceptual and motor skills》1992,74(2):627-642

The comparability of surveys is often hampered by differences in the item order of presentation. The major focus of the present study was to investigate whether a general item or a specific item at the beginning of the questionnaire would affect the endorsement as well as the scalability of a set of attitude items. By using a quasi-A-B-A experimental design for the six abortion items that appeared in the Edmonton Area Survey for the years 1984, 1987, and 1988, we found that the order of presentation of the items affected dramatically the endorsement of the abortion items. Approval of a general item was considerably higher when asked first than when asked after a specific item. In contrast, it was shown by means of a nonparametric item response theory model (the Mokken scale analysis) that the unidimensionality of the six abortion items was not affected by the manipulations of item order (i.e., the six abortion items measured the same concept in the three surveys). It was concluded that the six items are unidimensional and, therefore, create a single scale to measure the change in abortion attitudes across the three periods. 相似文献

10.

考虑题目选项信息的非参数认知诊断计算机自适应测验

孙小坚郭磊《心理学报》2022,54(9):1137-1150

选择题中的作答选项能提供额外诊断信息, 为充分利用选项信息, 研究提出认知诊断计算机自适应测验(CD-CAT)中两种处理选择题选项信息的非参数选题策略和变长终止规则。模拟研究的结果发现：(1)定长条件下两种非参数选题策略的分类准确性整体要高于参数选题策略; (2)两种非参数选题策略较参数选题策略具有更加均衡的题库使用情况; (3)非参数选题策略在两种新的变长终止规则下具有更高的分类准确率; (4)两种非参数选题策略均适用于选择题CD-CAT情境, 使用者可任选其一进行测验分析。相似文献

11.

Differential item functioning in an international 360-degree assessment: Evidence of gender stereotype,environmental complexity,and organizational contingency

Jim Penny 《European Journal of Work and Organizational Psychology》2013,22(3):245-271

This research used logistic regression to model item responses from a popular 360-degree-for-development survey used in a leadership development programme given to middle and upper level European managers in Brussels. The survey contained 106 items on 16 scales. The model used gender of ratee and rater group to identify items that exhibited differential item functioning (DIF). The rater groups were self, boss, peer, and direct report. The sample consisted of 356 survey families where a survey family consisted of a matched set of four surveys: one self, one boss, one peer, and one direct report. The sample contained 88% male and 12% female raters. The sample contained 1424 total surveys. The procedure for flagging items exhibiting differential functioning used effect size computed from Wald chi-square statistics rather than statistical significance, resulting in fewer flagged items. One item exhibited rating anomalies due to the gender of the ratee; 55 items exhibited DIF attributable to rater group. The apparent effect of the DIF was small with each item. An examination of the maximum likelihood parameter estimates suggested the rater group DIF was the result of either hierarchical complexity or organizational contingency. The DIF due to gender conformed to prior expectations of gender-related stereotypical interpretations. This research further suggested that DIF due to environmental complexity or organizational contingency could be a naturally occurring phenomenon in some 360-degree assessment, and that the interpretation of some 360-degree feedback could need to include the potential for such DIF to exist. 相似文献

12.

Comparative judgments as an alternative to ratings: identifying the scale origin

Böckenholt U 《心理学方法》2004,9(4):453-465

Although comparative judgment methods have a number of distinct advantages over ratings, they share one common problem: On the basis of comparative judgments, it is not possible to recover the origin of item evaluations. One item may be judged more positively than another, but this result does not allow any conclusions about whether either of the items are attractive or unattractive. This article discusses the implications of this limitation for the interpretation of individual differences in comparative judgments. It also presents 3 different methods that may allow determination of the scale origin using a nested model comparison approach. An application illustrates the proposed approach as well as the benefits of determining the scale origin in understanding value judgments. 相似文献

13.

Sampling theory in item analysis

Walter W. Merrill Jr. 《Psychometrika》1937,2(4):215-223

Since item values obtained by item analysis procedures are not always stable from one situation to another, it follows that selection of items for validity or difficulty is sometimes useless. An application of Chi Square to testing homogeneity of item values is made, in the case of theUL method, and illustrative data are presented. A method of applying sampling theory to Horst's maximizing function is outlined, as illustrative of author's observation that the results of item analysis by any of various methods may be similarly tested. 相似文献

14.

A diagnostic tree model for polytomous responses with multiple strategies

Wenchao Ma 《The British journal of mathematical and statistical psychology》2019,72(1):61-82

Constructed-response items have been shown to be appropriate for cognitively diagnostic assessments because students’ problem-solving procedures can be observed, providing direct evidence for making inferences about their proficiency. However, multiple strategies used by students make item scoring and psychometric analyses challenging. This study introduces the so-called two-digit scoring scheme into diagnostic assessments to record both students’ partial credits and their strategies. This study also proposes a diagnostic tree model (DTM) by integrating the cognitive diagnosis models with the tree model to analyse the items scored using the two-digit rubrics. Both convergent and divergent tree structures are considered to accommodate various scoring rules. The MMLE/EM algorithm is used for item parameter estimation of the DTM, and has been shown to provide good parameter recovery under varied conditions in a simulation study. A set of data from TIMSS 2007 mathematics assessment is analysed to illustrate the use of the two-digit scoring scheme and the DTM. 相似文献

15.

A Multilevel Nonlinear Profile Analysis Model for Dichotomous Data

Steven Andrew Culpepper 《Multivariate behavioral research》2013,48(5):646-667

This study linked nonlinear profile analysis (NPA) of dichotomous responses with an existing family of item response theory models and generalized latent variable models (GLVM). The NPA method offers several benefits over previous internal profile analysis methods: (a) NPA is estimated with maximum likelihood in a GLVM framework rather than relying on the choice of different dissimilarity measures that produce different results, (b) item and person parameters are computed during the same estimation step with an appropriate distribution for dichotomous variables, (c) the model estimates profile coordinate standard errors, and (d) additional individual-level variables can be included to model relationships with the profile parameters. An application examined experimental differences in topographic map comprehension among 288 subjects. The model produced a measure of overall test performance or comprehension in addition to pattern variables that measured the correspondence between subject response profiles and an item difficulty profile and an item-discrimination profile. The findings suggested that subjects who used 3-dimensional maps tended to correctly answer more items in addition to correctly answering items that were more discriminating indicators of map comprehension. The NPA analysis was also compared with results from a multidimensional item response theory model. 相似文献

16.

Sufficiency and Conditional Estimation of Person Parameters in the Polytomous Rasch Model

David Andrich 《Psychometrika》2010,75(2):292-308

Rasch models are characterised by sufficient statistics for all parameters. In the Rasch unidimensional model for two ordered categories, the parameterisation of the person and item is symmetrical and it is readily established that the total scores of a person and item are sufficient statistics for their respective parameters. In contrast, in the unidimensional polytomous Rasch model for more than two ordered categories, the parameterisation is not symmetrical. Specifically, each item has a vector of item parameters, one for each category, and each person only one person parameter. In addition, different items can have different numbers of categories and, therefore, different numbers of parameters. The sufficient statistic for the parameters of an item is itself a vector. In estimating the person parameters in presently available software, these sufficient statistics are not used to condition out the item parameters. This paper derives a conditional, pairwise, pseudo-likelihood and constructs estimates of the parameters of any number of persons which are independent of all item parameters and of the maximum scores of all items. It also shows that these estimates are consistent. Although Rasch’s original work began with equating tests using test scores, and not with items of a test, the polytomous Rasch model has not been applied in this way. Operationally, this is because the current approaches, in which item parameters are estimated first, cannot handle test data where there may be many scores with zero frequencies. A small simulation study shows that, when using the estimation equations derived in this paper, such a property of the data is no impediment to the application of the model at the level of tests. This opens up the possibility of using the polytomous Rasch model directly in equating test scores. 相似文献

17.

Optimal Item Calibration for Computerized Achievement Tests

Ul Hassan Mahmood Miller Frank 《Psychometrika》2019,84(4):1101-1128

Item calibration is a technique to estimate characteristics of questions (called items) for achievement tests. In computerized tests, item calibration is an important tool for maintaining, updating and developing new items for an item bank. To efficiently sample examinees with specific ability levels for this calibration, we use optimal design theory assuming that the probability to answer correctly follows an item response model. Locally optimal unrestricted designs have usually a few design points for ability. In practice, it is hard to sample examinees from a population with these specific ability levels due to unavailability or limited availability of examinees. To counter this problem, we use the concept of optimal restricted designs and show that this concept naturally fits to item calibration. We prove an equivalence theorem needed to verify optimality of a design. Locally optimal restricted designs provide intervals of ability levels for optimal calibration of an item. When assuming a two-parameter logistic model, several scenarios with D-optimal restricted designs are presented for calibration of a single item and simultaneous calibration of several items. These scenarios show that the naive way to sample examinees around unrestricted design points is not optimal.

相似文献

18.

A Comparison of Using the Fixed Common-Precalibrated Parameter Method and the Matched Characteristic Curve Method for Linking Multiple-Test Items

《International Journal of Testing》2013,13(3):267-293

A linking design typically consists of a data collection procedure together with an item linking procedure that places item parameters calibrated from multiple test forms onto a common scale. This study considered 2 potentially useful item response theory linking designs. The first one is characterized by selecting a single set of common items across all multiple test forms, the precalibrated item parameters of which are kept fixed while the unknown parameters of the other items are being estimated. This linking design will be referred to as the fixed common-precalibrated item parameter design. However, data collected under this design could also be analyzed by the characteristic curve method, which constituted an alternative linking procedure. In this study, the relative merits of the 2 linking designs were examined with respect to their robustness against 3 manipulated conditions-namely, when the common items have imprecise estimates, when there is a noticeable difference in the average item difficulty between the common and the noncommon items, and when the examinees are heterogeneous in terms of their abilities. A parameter recovery study was conducted to achieve this purpose. The results indicated that both linking designs were capable of producing accurate linking of items and equivalent estimation of ability parameters under the 3 conditions. When the 2 designs were actually utilized in the development of an item bank, it was found that both linking designs produced quite consistent solutions despite minor differences on some item and ability estimates. Condition under which a linking design is preferred over the other is also provided in the Discussion section of this article. 相似文献

19.

Rasch models for item bundles

Mark Wilson Raymond J. Adams 《Psychometrika》1995,60(2):181-198

This paper discusses the application of a class of Rasch models to situations where test items are grouped into subsets and the common attributes of items within these subsets brings into question the usual assumption of conditional independence. The models are all expressed as particular cases of the random coefficients multinomial logit model developed by Adams and Wilson. This formulation allows a very flexible approach to the specification of alternative models, and makes model testing particularly straightforward. The use of the models is illustrated using item bundles constructed in the framework of the SOLO taxonomy of Biggs and Collis.The work of both authors was supported by fellowships from the National Academy of Education Spencer Fellowship. 相似文献

20.

An autoregressive growth model for longitudinal item analysis

Minjeong Jeon Sophia Rabe-Hesketh 《Psychometrika》2016,81(3):830-850

A first-order autoregressive growth model is proposed for longitudinal binary item analysis where responses to the same items are conditionally dependent across time given the latent traits. Specifically, the item response probability for a given item at a given time depends on the latent trait as well as the response to the same item at the previous time, or the lagged response. An initial conditions problem arises because there is no lagged response at the initial time period. We handle this problem by adapting solutions proposed for dynamic models in panel data econometrics. Asymptotic and finite sample power for the autoregressive parameters are investigated. The consequences of ignoring local dependence and the initial conditions problem are also examined for data simulated from a first-order autoregressive growth model. The proposed methods are applied to longitudinal data on Korean students’ self-esteem. 相似文献