共查询到20条相似文献,搜索用时 15 毫秒
1.
Limited‐information goodness‐of‐fit testing of diagnostic classification item response models 下载免费PDF全文
Mark Hansen Li Cai Scott Monroe Zhen Li 《The British journal of mathematical and statistical psychology》2016,69(3):225-252
Despite the growing popularity of diagnostic classification models (e.g., Rupp et al., 2010, Diagnostic measurement: theory, methods, and applications, Guilford Press, New York, NY) in educational and psychological measurement, methods for testing their absolute goodness of fit to real data remain relatively underdeveloped. For tests of reasonable length and for realistic sample size, full‐information test statistics such as Pearson's X2 and the likelihood ratio statistic G2 suffer from sparseness in the underlying contingency table from which they are computed. Recently, limited‐information fit statistics such as Maydeu‐Olivares and Joe's (2006, Psychometrika, 71, 713) M2 have been found to be quite useful in testing the overall goodness of fit of item response theory models. In this study, we applied Maydeu‐Olivares and Joe's (2006, Psychometrika, 71, 713) M2 statistic to diagnostic classification models. Through a series of simulation studies, we found that M2 is well calibrated across a wide range of diagnostic model structures and was sensitive to certain misspecifications of the item model (e.g., fitting disjunctive models to data generated according to a conjunctive model), errors in the Q‐matrix (adding or omitting paths, omitting a latent variable), and violations of local item independence due to unmodelled testlet effects. On the other hand, M2 was largely insensitive to misspecifications in the distribution of higher‐order latent dimensions and to the specification of an extraneous attribute. To complement the analyses of the overall model goodness of fit using M2, we investigated the utility of the Chen and Thissen (1997, J. Educ. Behav. Stat., 22, 265) local dependence statistic for characterizing sources of misfit, an important aspect of model appraisal often overlooked in favour of overall statements. The statistic was found to be slightly conservative (with Type I error rates consistently below the nominal level) but still useful in pinpointing the sources of misfit. Patterns of local dependence arising due to specific model misspecifications are illustrated. Finally, we used the M2 and statistics to evaluate a diagnostic model fit to data from the Trends in Mathematics and Science Study, drawing upon analyses previously conducted by Lee et al., (2011, IJT, 11, 144). 相似文献
2.
Li Cai Mark Hansen 《The British journal of mathematical and statistical psychology》2013,66(2):245-276
In applications of item response theory, assessment of model fit is a critical issue. Recently, limited‐information goodness‐of‐fit testing has received increased attention in the psychometrics literature. In contrast to full‐information test statistics such as Pearson’s X2 or the likelihood ratio G2, these limited‐information tests utilize lower‐order marginal tables rather than the full contingency table. A notable example is Maydeu‐Olivares and colleagues’M2 family of statistics based on univariate and bivariate margins. When the contingency table is sparse, tests based on M2 retain better Type I error rate control than the full‐information tests and can be more powerful. While in principle the M2 statistic can be extended to test hierarchical multidimensional item factor models (e.g., bifactor and testlet models), the computation is non‐trivial. To obtain M2, a researcher often has to obtain (many thousands of) marginal probabilities, derivatives, and weights. Each of these must be approximated with high‐dimensional numerical integration. We propose a dimension reduction method that can take advantage of the hierarchical factor structure so that the integrals can be approximated far more efficiently. We also propose a new test statistic that can be substantially better calibrated and more powerful than the original M2 statistic when the test is long and the items are polytomous. We use simulations to demonstrate the performance of our new methods and illustrate their effectiveness with applications to real data. 相似文献
3.
《The British journal of mathematical and statistical psychology》2006,59(2):429-449
Assessing item fit for unidimensional item response theory models for dichotomous items has always been an issue of enormous interest, but there exists no unanimously agreed item fit diagnostic for these models, and hence there is room for further investigation of the area. This paper employs the posterior predictive model‐checking method, a popular Bayesian model‐checking tool, to examine item fit for the above‐mentioned models. An item fit plot, comparing the observed and predicted proportion‐correct scores of examinees with different raw scores, is suggested. This paper also suggests how to obtain posterior predictive p‐values (which are natural Bayesian p‐values) for the item fit statistics of Orlando and Thissen that summarize numerically the information in the above‐mentioned item fit plots. A number of simulation studies and a real data application demonstrate the effectiveness of the suggested item fit diagnostics. The suggested techniques seem to have adequate power and reasonable Type I error rate, and psychometricians will find them promising. 相似文献
4.
5.
Using SAS PROC NLMIXED to fit item response theory models 总被引:1,自引:0,他引:1
Researchers routinely construct tests or questionnaires containing a set of items that measure personality traits, cognitive
abilities, political attitudes, and so forth. Typically, responses to these items are scored in discrete categories, such
as points on a Likert scale or a choice out of several mutually exclusive alternatives. Item response theory (IRT) explains
observed responses to items on a test (questionnaire) by a person’s unobserved trait, ability, or attitude. Although applications
of IRT modeling have increased considerably because of its utility in developing and assessing measuring instruments, IRT
modeling has not been fully integrated into the curriculum of colleges and universities, mainly because existing general purpose
statistical packages do not provide built-in routines with which to perform IRT modeling. Recent advances in statistical theory
and the incorporation of those advances into general purpose statistical software such as the Statistical Analysis System
(SAS) allow researchers to analyze measurement data by using a class of models known as generalized linear mixed effects models
(McCulloch & Searle, 2001), which include IRT models as special cases. The purpose of this article is to demonstrate the generality
and flexibility of using SAS to estimate IRT model parameters. With real data examples, we illustrate the implementations
of a variety of IRT models for dichotomous, polytomous, and nominal responses. Since SAS is widely available in educational
institutions, it is hoped that this article will contribute to the spread of IRT modeling in quantitative courses. 相似文献
6.
《The British journal of mathematical and statistical psychology》2003,56(2):271-288
In sparse tables for categorical data well‐known goodness‐of‐fit statistics are not chi‐square distributed. A consequence is that model selection becomes a problem. It has been suggested that a way out of this problem is the use of the parametric bootstrap. In this paper, the parametric bootstrap goodness‐of‐fit test is studied by means of an extensive simulation study; the Type I error rates and power of this test are studied under several conditions of sparseness. In the presence of sparseness, models were used that were likely to violate the regularity conditions. Besides bootstrapping the goodness‐of‐fit usually used (full information statistics), corrected versions of these statistics and a limited information statistic are bootstrapped. These bootstrap tests were also compared to an asymptotic test using limited information. Results indicate that bootstrapping the usual statistics fails because these tests are too liberal, and that bootstrapping or asymptotically testing the limited information statistic works better with respect to Type I error and outperforms the other statistics by far in terms of statistical power. The properties of all tests are illustrated using categorical Markov models. 相似文献
7.
《The British journal of mathematical and statistical psychology》2004,57(1):73-96
Multinomial models are increasingly being used in psychology, and this use always requires estimating model parameters and testing goodness of fit with a composite null hypothesis. Goodness of fit is customarily tested with recourse to the asymptotic approximation to the distribution of the statistics. An assessment of the quality of this approximation requires a comparison with the exact distribution, but how to compute this exact distribution when parameters are estimated from the data appears never to have been defined precisely. The main goal of this paper is to compare two different approaches to defining this exact distribution. One of the approaches uses the marginal distribution and is, therefore, independent of the data; the other approach uses the conditional distribution of the statistics given the estimated parameters and, therefore, is data—dependent. We carried out a thorough study involving various parameter estimation methods and goodness‐of‐fit statistics, all of them members of the general class of power‐divergence measures. Included in the study were multinomial models with three to five cells and up to three parameters. Our results indicate that the asymptotic distribution is rarely a good approximation to the exact marginal distribution of the statistics, whereas it is a good approximation to the exact conditional distribution only when the vector of expected frequencies is interior to the sample space of the multinomial distribution. 相似文献
8.
《The British journal of mathematical and statistical psychology》2006,59(2):379-395
This paper proposes two unidimensional item response theory (IRT) models for analysing normative forced‐choice personality items. Both models are derived from a common theoretical framework and arise as a result of different assumptions regarding the mechanism of choice. The simplest mechanism gives rise to the one‐parameter normal‐ogive model. The second mechanism gives rise to a new IRT model, which is closely related to the Coombs–Zinnes probabilistic unfolding model. The second model is compared theoretically to the normal‐ogive model in terms of item characteristic curves and amount of item information. Next, procedures for estimating the respondent and the item parameters in the second model are described. Finally, both models are empirically compared by using two well‐known personality measures. 相似文献
9.
A model‐based procedure for assessing the extent to which missing data can be ignored and handling non‐ignorable missing data is presented. The procedure is based on item response theory modelling. As an example, the approach is worked out in detail in conjunction with item response data modelled using the partial credit and generalized partial credit models. Simulation studies are carried out to assess the extent to which the bias caused by ignoring the missing‐data mechanism can be reduced. Finally, the feasibility of the procedure is demonstrated using data from a study to calibrate a medical disability scale. 相似文献
10.
Non‐ignorable missingness item response theory models for choice effects in examinee‐selected items 下载免费PDF全文
Chen‐Wei Liu Wen‐Chung Wang 《The British journal of mathematical and statistical psychology》2017,70(3):499-524
Examinee‐selected item (ESI) design, in which examinees are required to respond to a fixed number of items in a given set, always yields incomplete data (i.e., when only the selected items are answered, data are missing for the others) that are likely non‐ignorable in likelihood inference. Standard item response theory (IRT) models become infeasible when ESI data are missing not at random (MNAR). To solve this problem, the authors propose a two‐dimensional IRT model that posits one unidimensional IRT model for observed data and another for nominal selection patterns. The two latent variables are assumed to follow a bivariate normal distribution. In this study, the mirt freeware package was adopted to estimate parameters. The authors conduct an experiment to demonstrate that ESI data are often non‐ignorable and to determine how to apply the new model to the data collected. Two follow‐up simulation studies are conducted to assess the parameter recovery of the new model and the consequences for parameter estimation of ignoring MNAR data. The results of the two simulation studies indicate good parameter recovery of the new model and poor parameter recovery when non‐ignorable missing data were mistakenly treated as ignorable. 相似文献
11.
Various different item response theory (IRT) models can be used in educational and psychological measurement to analyze test data. One of the major drawbacks of these models is that efficient parameter estimation can only be achieved with very large data sets. Therefore, it is often worthwhile to search for designs of the test data that in some way will optimize the parameter estimates. The results from the statistical theory on optimal design can be applied for efficient estimation of the parameters.A major problem in finding an optimal design for IRT models is that the designs are only optimal for a given set of parameters, that is, they are locally optimal. Locally optimal designs can be constructed with a sequential design procedure. In this paper minimax designs are proposed for IRT models to overcome the problem of local optimality. Minimax designs are compared to sequentially constructed designs for the two parameter logistic model and the results show that minimax design can be nearly as efficient as sequentially constructed designs. 相似文献
12.
13.
A note on monotonicity of item response functions for ordered polytomous item response theory models
Hyeon-Ah Kang Ya-Hui Su Hua-Hua Chang 《The British journal of mathematical and statistical psychology》2018,71(3):523-535
A monotone relationship between a true score (τ) and a latent trait level (θ) has been a key assumption for many psychometric applications. The monotonicity property in dichotomous response models is evident as a result of a transformation via a test characteristic curve. Monotonicity in polytomous models, in contrast, is not immediately obvious because item response functions are determined by a set of response category curves, which are conceivably non-monotonic in θ. The purpose of the present note is to demonstrate strict monotonicity in ordered polytomous item response models. Five models that are widely used in operational assessments are considered for proof: the generalized partial credit model (Muraki, 1992, Applied Psychological Measurement, 16, 159), the nominal model (Bock, 1972, Psychometrika, 37, 29), the partial credit model (Masters, 1982, Psychometrika, 47, 147), the rating scale model (Andrich, 1978, Psychometrika, 43, 561), and the graded response model (Samejima, 1972, A general model for free-response data (Psychometric Monograph no. 18). Psychometric Society, Richmond). The study asserts that the item response functions in these models strictly increase in θ and thus there exists strict monotonicity between τ and θ under certain specified conditions. This conclusion validates the practice of customarily using τ in place of θ in applied settings and provides theoretical grounds for one-to-one transformations between the two scales. 相似文献
14.
Björn Andersson Hao Luo Kseniia Marcq 《The British journal of mathematical and statistical psychology》2022,75(2):395-410
Reliability of scores from psychological or educational assessments provides important information regarding the precision of measurement. The reliability of scores is however population dependent and may vary across groups. In item response theory, this population dependence can be attributed to differential item functioning or to differences in the latent distributions between groups and needs to be accounted for when estimating the reliability of scores for different groups. Here, we introduce group-specific and overall reliability coefficients for sum scores and maximum likelihood ability estimates defined by a multiple group item response theory model. We derive confidence intervals using asymptotic theory and evaluate the empirical properties of estimators and the confidence intervals in a simulation study. The results show that the estimators are largely unbiased and that the confidence intervals are accurate with moderately large sample sizes. We exemplify the approach with the Montreal Cognitive Assessment (MoCA) in two groups defined by education level and give recommendations for applied work. 相似文献
15.
Paul W. Holland 《Psychometrika》1990,55(4):577-601
Item response theory (IT) models are now in common use for the analysis of dichotomous item responses. This paper examines the sampling theory foundations for statistical inference in these models. The discussion includes: some history on the stochastic subject versus the random sampling interpretations of the probability in IRT models; the relationship between three versions of maximum likelihood estimation for IRT models; estimating versus estimating -predictors; IRT models and loglinear models; the identifiability of IRT models; and the role of robustness and Bayesian statistics from the sampling theory perspective.A presidential address can serve many different functions. This one is a report of investigations I started at least ten years ago to understand what IRT was all about. It is a decidedly one-sided view, but I hope it stimulates controversy and further research. I have profited from discussions of this material with many people including: Brian Junker, Charles Lewis, Nicholas Longford, Robert Mislevy, Ivo Molenaar, Donald Rock, Donald Rubin, Lynne Steinberg, Martha Stocking, William Stout, Dorothy Thayer, David Thissen, Wim van der Linden, Howard Wainer, and Marilyn Wingersky. Of course, none of them is responsible for any errors or misstatements in this paper. The research was supported in part by the Cognitive Science Program, Office of Naval Research under Contract No. Nooo14-87-K-0730 and by the Program Statistics Research Project of Educational Testing Service. 相似文献
16.
Eric Loken Kelly L. Rulison 《The British journal of mathematical and statistical psychology》2010,63(3):509-525
We explore the justification and formulation of a four‐parameter item response theory model (4PM) and employ a Bayesian approach to recover successfully parameter estimates for items and respondents. For data generated using a 4PM item response model, overall fit is improved when using the 4PM rather than the 3PM or the 2PM. Furthermore, although estimated trait scores under the various models correlate almost perfectly, inferences at the high and low ends of the trait continuum are compromised, with poorer coverage of the confidence intervals when the wrong model is used. We also show in an empirical example that the 4PM can yield new insights into the properties of a widely used delinquency scale. We discuss the implications for building appropriate measurement models in education and psychology to model more accurately the underlying response process. 相似文献
17.
Jean‐Paul Fox Rinke Klein Entink Marianna Avetisyan 《The British journal of mathematical and statistical psychology》2014,67(1):133-152
Randomized response (RR) models are often used for analysing univariate randomized response data and measuring population prevalence of sensitive behaviours. There is much empirical support for the belief that RR methods improve the cooperation of the respondents. Recently, RR models have been extended to measure individual unidimensional behaviour. An extension of this modelling framework is proposed to measure compensatory or non‐compensatory multiple sensitive factors underlying the randomized item response process. A confirmatory multidimensional randomized item response theory model (MRIRT) is proposed for the analysis of multivariate RR data by modelling the response process and specifying structural relationships between sensitive behaviours and background information. A Markov chain Monte Carlo algorithm is developed to estimate simultaneously the parameters of the MRIRT model. The model extension enables the computation of individual true item response probabilities, estimates of individuals’ sensitive behaviour on different domains, and their relationships with background variables. An MRIRT analysis is presented of data from a college alcohol problem scale, measuring alcohol‐related socio‐emotional and community problems, and alcohol expectancy questionnaire, measuring alcohol‐related sexual enhancement expectancies. Students were interviewed via direct or RR questioning. Scores of alcohol‐related problems and expectancies are significantly higher for the group of students questioned using the RR technique. Alcohol‐related problems and sexual enhancement expectancies are positively moderately correlated and vary differently across gender and universities. 相似文献
18.
Testing the fit of finite mixture models is a difficult task, since asymptotic results on the distribution of likelihood ratio statistics do not hold; for this reason, alternative statistics are needed. This paper applies the π* goodness of fit statistic to finite mixture item response models. The π* statistic assumes that the population is composed of two subpopulations – those that follow a parametric model and a residual group outside the model; π* is defined as the proportion of population in the residual group. The population was divided into two or more groups, or classes. Several groups followed an item response model and there was also a residual group. The paper presents maximum likelihood algorithms for estimating item parameters, the probabilities of the groups and π*. The paper also includes a simulation study on goodness of recovery for the two‐ and three‐parameter logistic models and an example with real data from a multiple choice test. 相似文献
19.
《The British journal of mathematical and statistical psychology》2003,56(1):127-143
Two types of global testing procedures for item fit to the Rasch model were evaluated using simulation studies. The first type incorporates three tests based on first‐order statistics: van den Wollenberg's Q1 test, Glas's R1 test, and Andersen's LR test. The second type incorporates three tests based on second‐order statistics: van den Wollenberg's Q2 test, Glas's R2 test, and a non‐parametric test proposed by Ponocny. The Type I error rates and the power against the violation of parallel item response curves, unidimensionality and local independence were analysed in relation to sample size and test length. In general, the outcomes indicate a satisfactory performance of all tests, except the Q2 test which exhibits an inflated Type I error rate. Further, it was found that both types of tests have power against all three types of model violation. A possible explanation is the interdependencies among the assumptions underlying the model. 相似文献
20.
Wendy M. Yen 《Psychometrika》1985,50(4):399-410
When the three-parameter logistic model is applied to tests covering a broad range of difficulty, there frequently is an increase in mean item discrimination and a decrease in variance of item difficulties and traits as the tests become more difficult. To examine the hypothesis that this unexpected scale shrinkage effect occurs because the items increase in complexity as they increase in difficulty, an approximate relationship is derived between the unidimensional model used in data analysis and a multidimensional model hypothesized to be generating the item responses. Scale shrinkage is successfully predicted for several sets of simulated data.The author is grateful to Robert Mislevy for kindly providing a copy of his computer program, RESOLVE. 相似文献