首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A new algorithm for obtaining exact person fit indexes for the Rasch model is introduced which realizes most powerful tests for a very general family of alternative hypotheses, including tests concerning DIF as well as model-deviating item correlations. The method is also used as a goodness-of-fit test for whole data sets where the item parameters are assumed to be known. For tests with 30 items at most, exact values are obtained, for longer tests a Monte Carlo-algorithm is proposed. Simulated examples and an empirical investigation demonstrate test power and applicability to item elimination.The author wishes to thank Elisabeth Ponocny-Seliger and the reviewers for many helpful comments. All exact goodness-of-fit tests proposed in this article are implemented in the menu-driven program T-Rasch 1.0 by Ponocny and Ponocny-Seliger (1999) which can be obtained from ProGAMMA (WWW: http://www.gamma.rug.nl) and also performs nonparametric tests.  相似文献   

2.
Changes in dichotomous data caused by treatments can be analyzed by means of the so-called linear logistic model with relaxed assumptions (LLRA). The LLRA does not require observable criteria representing a single underlying latent trait, but it postulates the generalizability of the treatment effects over criteria and subjects. To test this latter crucial assumption, the mixture LLRA was proposed that allows directly unobservable types of subjects to have different treatment effects. As the earlier methods for estimating the parameters of the mixture LLRA have specific drawbacks, a further method based on the conditional maximum likelihood principle will be presented here. In contrast to the earlier conditional methods, it uses all of the dichotomous change data while having fewer parameters. Further, its goodness-of-fit tests become more sensitive to a falsely specified number of change-types even though the treatment effects are biased. For typically occurring small to moderate sample sizes, however, parametric bootstrapping of the distributions of the fit statistics is recommended for performing hypotheses tests. Finally, three applications of the new method to empirical data are described: first, about the effect of the so-called Trager psychophysical integration, second, about the effect of autogenic therapy on patients with psychosomatic symptoms, and, third, about the effect of religious education on the attitude towards sects. The mixture LLRA is implemented in the menu-driven program MIXLLRA which can be obtained from Ivo Ponocny via e-mail (ivo.ponocny@univie.ac.at).  相似文献   

3.
A set of linear conditions on item response functions is derived that guarantees identical observed-score distributions on two test forms. The conditions can be added as constraints to a linear programming model for test assembly that assembles a new test form to have an observed-score distribution optimally equated to the distribution on an old form. For a well-designed item pool and items fitting the IRT model, use of the model results into observed-score pre-equating and prevents the necessity ofpost hoc equating by a conventional observed-score equating method. An empirical example illustrates the use of the model for an item pool from the Law School Admission Test.The authors are most indebted to Norman D. Verhelst for suggesting Proposition 4 and its proof, to the Law School Admission Council (LSAC) for making available the data set, and to Wim M. M. Tielen for his computational assistance.  相似文献   

4.
For item responses fitting the Rasch model, the assumptions underlying the Mokken model of double monotonicity are met. This makes non‐parametric item response theory a natural starting‐point for Rasch item analysis. This paper studies scalability coefficients based on Loevinger's H coefficient that summarizes the number of Guttman errors in the data matrix. These coefficients are shown to yield efficient tests of the Rasch model using p‐values computed using Markov chain Monte Carlo methods. The power of the tests of unequal item discrimination, and their ability to distinguish between local dependence and unequal item discrimination, are discussed. The methods are illustrated and motivated using a simulation study and a real data example.  相似文献   

5.
Score tests for identifying locally dependent item pairs have been proposed for binary item response models. In this article, both the bifactor and the threshold shift score tests are generalized to the graded response model. For the bifactor test, the generalization is straightforward; it adds one secondary dimension associated only with one pair of items. For the threshold shift test, however, multiple generalizations are possible: in particular, conditional, uniform, and linear shift tests are discussed in this article. Simulation studies show that all of the score tests have accurate Type I error rates given large enough samples, although their small‐sample behaviour is not as good as that of Pearson's Χ2 and M2 as proposed in other studies for the purpose of local dependence (LD) detection. All score tests have the highest power to detect the LD which is consistent with their parametric form, and in this case they are uniformly more powerful than Χ2 and M2; even wrongly specified score tests are more powerful than Χ2 and M2 in most conditions. An example using empirical data is provided for illustration.  相似文献   

6.
迫选测验的传统计分方式会产生自模式数据, 不能进行传统的信效度检验、因素分析和方差分析等。近年来研究者提出了一些基于项目反应理论的计分模型, 如瑟斯顿IRT模型和MUPP模型等, 它们可以规避自模式数据的弊端。瑟斯顿IRT模型方便进行参数估计, 模型定义灵活; 而MUPP模型的拓展性较差, 参数估计的方法有待提高。另一方面, 已有研究者基于MUPP模型开发了一些抗作假的迫选测验, 而瑟斯顿IRT模型距离这种应用还比较远。此外, 两个模型的适用性和有效性都有待更多的实证研究来检验。  相似文献   

7.
The predictive validity of a psychological measure can be improved by minimizing measurement errors through increases in the length of the assessment (aggregation) and, for an assessment of finite length, by making use of objective strategies for choosing from all available component measures. Two prominent considerations in selecting individual measures to be aggregated involve standards of (a) item content (construct approach) and (b) item/criterion association (empirical approach). Personality trait scales of different lengths were assembled for this study in order to represent features of the construct and empirical methods of selection. It was observed that (a) although reliability and validity generally increased with test length, aggregation beyond a certain point can fail to be expedient; and (b) although the prediction performance of empirically derived measures initially surpassed that of construct based assessments, the superiority of the empirical scales did not generalize to trait criteria that were not used as a basis for item selection. The data are interpreted as providing support for a theory-based program of test development where substantive considerations involving item content play a major role. The findings are also viewed as encouragement for conventional conceptualizations about organized dimensions of behavior.  相似文献   

8.
Measurement invariance is a fundamental assumption in item response theory models, where the relationship between a latent construct (ability) and observed item responses is of interest. Violation of this assumption would render the scale misinterpreted or cause systematic bias against certain groups of persons. While a number of methods have been proposed to detect measurement invariance violations, they typically require advance definition of problematic item parameters and respondent grouping information. However, these pieces of information are typically unknown in practice. As an alternative, this paper focuses on a family of recently proposed tests based on stochastic processes of casewise derivatives of the likelihood function (i.e., scores). These score-based tests only require estimation of the null model (when measurement invariance is assumed to hold), and they have been previously applied in factor-analytic, continuous data contexts as well as in models of the Rasch family. In this paper, we aim to extend these tests to two-parameter item response models, with strong emphasis on pairwise maximum likelihood. The tests’ theoretical background and implementation are detailed, and the tests’ abilities to identify problematic item parameters are studied via simulation. An empirical example illustrating the tests’ use in practice is also provided.  相似文献   

9.
Two types of global testing procedures for item fit to the Rasch model were evaluated using simulation studies. The first type incorporates three tests based on first‐order statistics: van den Wollenberg's Q1 test, Glas's R1 test, and Andersen's LR test. The second type incorporates three tests based on second‐order statistics: van den Wollenberg's Q2 test, Glas's R2 test, and a non‐parametric test proposed by Ponocny. The Type I error rates and the power against the violation of parallel item response curves, unidimensionality and local independence were analysed in relation to sample size and test length. In general, the outcomes indicate a satisfactory performance of all tests, except the Q2 test which exhibits an inflated Type I error rate. Further, it was found that both types of tests have power against all three types of model violation. A possible explanation is the interdependencies among the assumptions underlying the model.  相似文献   

10.
The sampling properties of four item discrimination indices (biserialr, Cook's indexB, theU–L 27 per cent index, and DeltaP) were investigated in order to ascertain their sampling properties when small samples drawn from actual test data rather than constructed data were employed. The empirical results indicated that the mean index values approximated the population values and that values of the standard deviations computed from large sample formulas were good approximations to the standard deviations of the observed distributions based on samples of size 120 or less. Goodness of fit tests comparing the observed distributions with the corresponding distribution of the product-moment correlation coefficient based upon a bivariate normal population indicated that this correlational model was inappropriate for the data. The lack of adequate mathematical models for the sampling distributions of item discrimination indices suggests that one should avoid indices whose only real reason for existence was the simplification of computational procedures.This research reported herein was performed pursuant to a contract (OE-2-10-071) with the United States Office of Education, Department of Health, Education and Welfare.  相似文献   

11.
A real-data simulation of computerized adaptive testing (CAT) is an important step in real-life CAT applications. Such a simulation allows CAT developers to evaluate important features of the CAT system, such as item selection and stopping rules, before live testing. SIMPOLYCAT, an SAS macro program, was created by the authors to conduct real-data CAT simulations based on polytomous item response theory (IRT) models. In SIMPOLYCAT, item responses can be input from an external file or generated internally on the basis of item parameters provided by users. The program allows users to choose among methods of setting initial ?, approaches to item selection, trait estimators, CAT stopping criteria, polytomous IRT models, and other CAT parameters. In addition, CAT simulation results can be saved easily and used for further study. The purpose of this article is to introduce SIMPOLYCAT, briefly describe the program algorithm and parameters, and provide examples of CAT simulations, using generated and real data. Visual comparisons of the results obtained from the CAT simulations are presented.  相似文献   

12.
Even though many educational and psychological tests are known to be multidimensional, little research has been done to address how to measure individual differences in change within an item response theory framework. In this paper, we suggest a generalized explanatory longitudinal item response model to measure individual differences in change. New longitudinal models for multidimensional tests and existing models for unidimensional tests are presented within this framework and implemented with software developed for generalized linear models. In addition to the measurement of change, the longitudinal models we present can also be used to explain individual differences in change scores for person groups (e.g., learning disabled students versus non‐learning disabled students) and to model differences in item difficulties across item groups (e.g., number operation, measurement, and representation item groups in a mathematics test). An empirical example illustrates the use of the various models for measuring individual differences in change when there are person groups and multiple skill domains which lead to multidimensionality at a time point.  相似文献   

13.
This paper uses an extension of the network algorithm originally introduced by Mehta and Patel to construct exact tail probabilities for testing the general hypothesis that item responses are distributed according to the Rasch model. By assuming that item difficulties are known, the algorithm is applicable to the statistical tests either given the maximum likelihood ability estimate or conditioned on the total score. A simulation study indicates that the network algorithm is an efficient tool for computing the significance level of a person fit statistic based on test lengths of 30 items or less.  相似文献   

14.
This research is concerned with two topics in assessing model fit for categorical data analysis. The first topic involves the application of a limited-information overall test, introduced in the item response theory literature, to structural equation modeling (SEM) of categorical outcome variables. Most popular SEM test statistics assess how well the model reproduces estimated polychoric correlations. In contrast, limited-information test statistics assess how well the underlying categorical data are reproduced. Here, the recently introduced C2 statistic of Cai and Monroe (2014) is applied. The second topic concerns how the root mean square error of approximation (RMSEA) fit index can be affected by the number of categories in the outcome variable. This relationship creates challenges for interpreting RMSEA. While the two topics initially appear unrelated, they may conveniently be studied in tandem since RMSEA is based on an overall test statistic, such as C2. The results are illustrated with an empirical application to data from a large-scale educational survey.  相似文献   

15.
The four-parameter logistic (4PL) item response model, which includes an upper asymptote for the correct response probability, has drawn increasing interest due to its suitability for many practical scenarios. This paper proposes a new Gibbs sampling algorithm for estimation of the multidimensional 4PL model based on an efficient data augmentation scheme (DAGS). With the introduction of three continuous latent variables, the full conditional distributions are tractable, allowing easy implementation of a Gibbs sampler. Simulation studies are conducted to evaluate the proposed method and several popular alternatives. An empirical data set was analysed using the 4PL model to show its improved performance over the three-parameter and two-parameter logistic models. The proposed estimation scheme is easily accessible to practitioners through the open-source IRTlogit package.  相似文献   

16.
多分属性认知诊断模型(CDMs)比传统的二分属性CDMs提供更详细的诊断反馈信息,但现有大部分多分属性CDMs并不具备直接分析多级(或混合)评分数据的功能。本文基于等级反应模型对重参数化多分属性DINA模型进行多级评分拓广,开发一个可处理多级评分数据的等级反应多分属性DINA模型。首先通过实证数据分析呈现新模型的现实可应用性;然后通过模拟研究探究新模型的参数估计返真性。结果表明,新模型满足同时处理多分属性和多级评分数据的现实需求;且具备良好的心理计量学性能,但对测验质量有一定要求(如题目质量较高且测验Qp矩阵具有完备性等)。  相似文献   

17.
For detecting differential item functioning (DIF) between two or more groups of test takers in the Rasch model, their item parameters need to be placed on the same scale. Typically this is done by means of choosing a set of so-called anchor items based on statistical tests or heuristics. Here the authors suggest an alternative strategy: By means of an inequality criterion from economics, the Gini Index, the item parameters are shifted to an optimal position where the item parameter estimates of the groups best overlap. Several toy examples, extensive simulation studies, and two empirical application examples are presented to illustrate the properties of the Gini Index as an anchor point selection criterion and compare its properties to those of the criterion used in the alignment approach of Asparouhov and Muthén. In particular, the authors show that—in addition to the globally optimal position for the anchor point—the criterion plot contains valuable additional information and may help discover unaccounted DIF-inducing multidimensionality. They further provide mathematical results that enable an efficient sparse grid optimization and make it feasible to extend the approach, for example, to multiple group scenarios.  相似文献   

18.
A new observable consequence of the property of invariant item ordering is presented, which holds under Mokken’s double monotonicity model for dichotomous data. The observable consequence is an invariant ordering of the item-total regressions. Kendall’s measure of concordance W and a weighted version of this measure are proposed as measures for this property. Karabatsos and Sheu proposed a Bayesian procedure (Appl. Psychol. Meas. 28:110–125, 2004), which can be used to determine whether the property of an invariant ordering of the item-total regressions should be rejected for a set of items. An example is presented to illustrate the application of the procedures to empirical data.  相似文献   

19.
涂冬波  蔡艳  戴海琦  丁树良 《心理学报》2011,43(11):1329-1340
本研究介绍并引进了现代测量理论中的前沿技术—— 多维项目反应理论, 采用MCMC算法实现了其参数估计; 并将MIRT应用于瑞文高级推理测验, 以探讨MIRT在心理测验中的具体应用。研究结果表明:(1)本研究自主编制的MIRT参数估计程序基本可行, 其估计的精度与国外研究结论相当甚至更好。(2)在测验维度和样本容量两因素完全随机实验设计下(2×3), 随着被试和题目样本容量的增加, MIRT参数估计的精度越高且估计的稳定性越强; 但随着测验维度的增加, MIRT参数估计精度和稳定性均随之降低。(3)MIRT对心理测验的分析比UIRT能提供更为精确和细致的信息。它对心理测验的编制、开发及评价具有重要的指导和参考价值, 值得引进及借鉴。  相似文献   

20.
Autocorrelation and partial autocorrelation, which provide a mathematical tool to understand repeating patterns in time series data, are often used to facilitate the identification of model orders of time series models (e.g., moving average and autoregressive models). Asymptotic methods for testing autocorrelation and partial autocorrelation such as the 1/T approximation method and the Bartlett's formula method may fail in finite samples and are vulnerable to non-normality. Resampling techniques such as the moving block bootstrap and the surrogate data method are competitive alternatives. In this study, we use a Monte Carlo simulation study and a real data example to compare asymptotic methods with the aforementioned resampling techniques. For each resampling technique, we consider both the percentile method and the bias-corrected and accelerated method for interval construction. Simulation results show that the surrogate data method with percentile intervals yields better performance than the other methods. An R package pautocorr is used to carry out tests evaluated in this study.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号