Current approaches to model responses and response times to psychometric tests solely focus on between-subject differences in speed and ability. Within subjects, speed and ability are assumed to be constants. Violations of this assumption are generally absorbed in the residual of the model. As a result, within-subject departures from the between-subject speed and ability level remain undetected. These departures may be of interest to the researcher as they reflect differences in the response processes adopted on the items of a test. In this article, we propose a dynamic approach for responses and response times based on hidden Markov modeling to account for within-subject differences in responses and response times. A simulation study is conducted to demonstrate acceptable parameter recovery and acceptable performance of various fit indices in distinguishing between different models. In addition, both a confirmatory and an exploratory application are presented to demonstrate the practical value of the modeling approach.  相似文献   

题目位置效应(Item Position Effect, IPE)是指在剔除随机误差的影响之后, 同一道题目在不同测验间因题目位置的变化而导致题目参数的变化。IPE的存在会严重威胁依赖于项目反应理论参数不变性特征的相关应用, 比如测验等值和计算机化自适应测验。目前关于这一领域的研究主要集中于对IPE的检测, 而对所检测到的效应进行进一步的解释, 则是今后的研究重点。另外, 在不同的研究情境下深入探讨IPE, 对于基础研究领域和实践领域都具有重要意义。  相似文献   

This study examines separate and concurrent approaches to combine the detection of item parameter drift (IPD) and the estimation of scale transformation coefficients in the context of the common item nonequivalent groups design with the three-parameter item response theory equating. The study uses real and synthetic data sets to compare the two approaches based on IPD flagging rates, type I error and power rates, and recovery of scale transformation coefficients. Results indicate that the two approaches render similar outcomes with stable anchor sets. However, they can produce dissimilar results with unstable anchor sets because of differences in the performance of their IPD components. Further, the findings of this study caution about working backward from equated cut scores to motivate the selection of an anchor set.  相似文献   

Probabilistic reasoning skills are important in various contexts. The aim of the present study was to develop a new instrument (the Probabilistic Reasoning Scale – PRS) to accurately measure low levels of probabilistic reasoning ability in order to identify people with difficulties in this domain. Item response theory was applied to construct the scale, and to investigate differential item functioning (i.e., whether the items were invariant) across genders, educational levels, and languages. Additionally, we tested the validity of the scale by investigating the relationships between the PRS and several other measures. The results revealed that the items had a low level of difficulty. Nonetheless, the discriminative measures showed that the items can discriminate between individuals with different trait levels, and the test information function showed that the scale accurately assesses low levels of probabilistic reasoning ability. Additionally, through investigating differential item functioning, the measurement equivalence of the scale at the item level was confirmed for gender, educational status, and language (i.e., Italian and English). Concerning validity, the results showed the expected correlations with numerical skills, math‐related attitudes, statistics achievement, IQ, reasoning skills, and risky choices both in the Italian and British samples. In conclusion, the PRS is an ideal instrument for identifying individuals who struggle with basic probabilistic reasoning, and who could be targeted by specific interventions. Copyright © 2017 John Wiley & Sons, Ltd.  相似文献   

There is a growing use of noncognitive assessments around the world, and recent research has posited an ideal point response process underlying such measures. A critical issue is whether the typical use of dominance approaches (e.g., average scores, factor analysis, and the Samejima's graded response model) in scoring such measures is adequate. This study examined the performance of an ideal point scoring approach (e.g., the generalized graded unfolding model) as compared to the typical dominance scoring approaches in detecting curvilinear relationships between scored trait and external variable. Simulation results showed that when data followed the ideal point model, the ideal point approach generally exhibited more power and provided more accurate estimates of curvilinear effects than the dominance approaches. No substantial difference was found between ideal point and dominance scoring approaches in terms of Type I error rate and bias across different sample sizes and scale lengths, although skewness in the distribution of trait and external variable can potentially reduce statistical power. For dominance data, the ideal point scoring approach exhibited convergence problems in most conditions and failed to perform as well as the dominance scoring approaches. Practical implications for scoring responses to Likert-type surveys to examine curvilinear effects are discussed.  相似文献   

项目反应理论是测量被试潜在特质的现代测量理论, 潜在类别分析是基于模型的潜在特质分类技术。混合项目反应理论将项目反应理论与潜在类别分析相结合, 能够同时对被试分类并量化其潜在特质。在阐述混合项目反应理论概念、原理的基础上, 介绍了MRM、mNRM和mPCM等几种常见混合模型及其参数估计方法, 并从心理与行为特征分类、项目功能差异检测、测验效度评价等方面评述了其在心理测验中的应用发展轨迹。  相似文献   

Computerized classification testing (CCT) commonly chooses items maximizing information at the cut score, which yields the most information for decision-making. However, a corollary problem is that all examinees will be given the same set of items, resulting in high test overlap rate and unbalanced item bank usage, which threatens test security. Moreover, another pivotal issue for CCT is time control. Since both the extremely long response time (RT) and large RT variability across examinees intensify time-induced anxiety, it is crucial to reduce the number of examinees exceeding the time limitation and the differences between examinees' test-taking times. To satisfy these practical needs, this paper proposes the novel idea of stage adaptiveness to tailor the item selection process to the decision-making requirement in each step and generate fresh insight into the existing response time selection method. Results indicate that a balanced item usage as well as short and stable test times across examinees can be achieved via the new methods.  相似文献   

当观测指标变量为二分分类数据时,传统的因素分析方法不再适用。作者简要回顾了SEM框架下的分类数据因素分析模型和IRT框架下的测验题目和潜在能力的关系模型,并对两种框架下主要采用的参数估计方法进行了总结。通过两个模拟研究,比较了SEM框架下GLSc和MGLSc估计方法与IRT框架下MML/EM估计方法的差异。研究结果表明:(1)三种方法中,GLSc得到参数估计的偏差最大,MGLSc和MML/EM估计方法相差不大;(2)随着样本量增大,各种项目参数估计的精度均提高;(3)项目因素载荷和难度估计的精度受测验长度的影响;(4)项目因素载荷和区分度估计的精度受总体因素载荷(区分度)高低的影响;(5)测验项目中阈值的分布会影响参数估计的精度,其中受影响最大的是项目区分度。(6)总体来看,SEM框架下的项目参数估计精度较IRT框架下项目参数估计的精度高。此外,文章还将两种方法在实际应用中应该注意的问题提供了一些建议。  相似文献   

心理和教育测量一般只能达到顺序量表的水平,其测量数据与被测因子间并非简单线性关系。题目因素分析是用来描述测量题目与因子间非线性关系的统计模型。题目因素分析主要有基于结构方程模型和基于项目反应理论两类方法,两类方法之间存在紧密的联系,甚至可以看作是同一模型的两种表现形式。本文详细阐述了该关系,同时对两类方法在参数估计、模型拟合指标、测量一致性检验和支撑软件等方面的特点进行了分析和比较,以便研究者选择最为适合其研究的方法。  相似文献   

The Social Phobia Inventory (SPIN) is a widely used measure in mental health settings and a 3-item version (mini-SPIN) has been developed as a screening instrument for social anxiety disorder. In the present study, we examined the psychometric properties of the SPIN and developed a brief version (mini-SPIN-R) designed to assess social anxiety severity using item response theory. Our sample included 569 individuals with social anxiety disorder who participated in 2 clinical trials and filled out a battery of self-report measures. Using a nonparametric kernel smoothing method we identified the most sensitive items of the SPIN. These 3 items comprised the mini-SPIN-R, which was found to have greater internal consistency, and to capture a greater range of symptoms compared to the mini-SPIN. The mini-SPIN-R evidenced superior convergent validity compared to the mini-SPIN and both measures had similar divergent validity. Thus, the mini-SPIN-R is a promising brief measure of social anxiety severity.  相似文献   

目前在验证性因素分析(CFA)和项目反应理论(IRT)两个领域,都有一些检验方法来识别项目功能差异(DIF)。该文主要针对单维的多级计分项目,分别介绍CFA和IRT检测DIF的方法,并进行二者的比较。   

In exploratory item factor analysis (IFA), researchers may use model fit statistics and commonly invoked fit thresholds to help determine the dimensionality of an assessment. However, these indices and thresholds may mislead as they were developed in a confirmatory framework for models with continuous, not categorical, indicators. The present study used Monte Carlo simulation methods to investigate the ability of popular model fit statistics (chi-square, root mean square error of approximation, the comparative fit index, and the Tucker–Lewis index) and their standard cutoff values to detect the optimal number of latent dimensions underlying sets of dichotomous items. Models were fit to data generated from three-factor population structures that varied in factor loading magnitude, factor intercorrelation magnitude, number of indicators, and whether cross loadings or minor factors were included. The effectiveness of the thresholds varied across fit statistics, and was conditional on many features of the underlying model. Together, results suggest that conventional fit thresholds offer questionable utility in the context of IFA.  相似文献   

Multilevel structural equation models are increasingly applied in psychological research. With increasing model complexity, estimation becomes computationally demanding, and small sample sizes pose further challenges on estimation methods relying on asymptotic theory. Recent developments of Bayesian estimation techniques may help to overcome the shortcomings of classical estimation techniques. The use of potentially inaccurate prior information may, however, have detrimental effects, especially in small samples. The present Monte Carlo simulation study compares the statistical performance of classical estimation techniques with Bayesian estimation using different prior specifications for a two-level SEM with either continuous or ordinal indicators. Using two software programs (Mplus and Stan), differential effects of between- and within-level sample sizes on estimation accuracy were investigated. Moreover, it was tested to which extent inaccurate priors may have detrimental effects on parameter estimates in categorical indicator models. For continuous indicators, Bayesian estimation did not show performance advantages over ML. For categorical indicators, Bayesian estimation outperformed WLSMV solely in case of strongly informative accurate priors. Weakly informative inaccurate priors did not deteriorate performance of the Bayesian approach, while strong informative inaccurate priors led to severely biased estimates even with large sample sizes. With diffuse priors, Stan yielded better results than Mplus in terms of parameter estimates.  相似文献   

在心理与教育测量中, 项目反应理论(Item Response Theory, IRT)模型的参数估计方法是理论研究与实践应用的基本工具。最近, 由于IRT模型的不断扩展与EM (expectation-maximization)算法自身的固有问题, 参数估计方法的改进与发展显得尤为重要。这里介绍了IRT模型中边际极大似然估计的发展, 提出了它的阶段性特征, 即联合极大似然估计阶段、确定性潜在心理特质“填补”阶段、随机潜在心理特质“填补”阶段, 重点阐述了它的潜在心理特质“填补” (data augmentation)思想。EM算法与Metropolis-Hastings Robbins-Monro (MH-RM)算法作为不同的潜在心理特质“填补”方法, 都是边际极大似然估计的思想跨越。目前, 潜在心理特质“填补”的参数估计方法仍在不断发展与完善。  相似文献   

针对测验中高能力被试答错容易试题的睡眠现象,可使用四参数Logistic模型分析数据。研究选取了来自心理测验和成就测验的实际数据,分别采用传统模型和四参数Logistic模型进行拟合,对不同模型的拟合指标及参数估计结果进行比较。结果表明,四参数Logistic模型能够提高拟合程度,增强估计结果的准确性,有效纠正高能力被试能力被低估的现象。建议在必要时使用四参数Logistic模型进行数据分析。  相似文献   

There are a growing number of item response theory (IRT) studies that calibrate different patient-reported outcome (PRO) measures, such as anxiety, depression, physical function, and pain, on common, instrument-independent metrics. In the case of depression, it has been reported that there are considerable mean score differences when scoring on a common metric from different, previously linked instruments. Ideally, those estimates should be the same. We investigated to what extent those differences are influenced by different scoring methods that take into account several levels of uncertainty, such as measurement error (through plausible value imputation) and item parameter uncertainty (through full Bayesian IRT modeling). Depression estimates from different instruments were more similar, and their corresponding confidence/credible intervals were larger when plausible value imputation or Bayesian modeling was used, compared to the direct use of expected a posteriori (EAP) estimates. Furthermore, we explored the use of Bayesian IRT models to update item parameters based on newly collected data.  相似文献   

Globally, the COVID-19 pandemic has impaired every aspect of life, especially causing much psychological damage—for instance, increasing the risk of suicide. Intense fear and anxiety are considered to play a central role in mental health problems. This study examined the psychological properties of the Japanese version of the Fear of COVID-19 Scale (FCV-19S) using classical test theory (CTT) and item response theory (IRT). Five hundred fifty participants aged 18–69 years and from across Japan completed questionnaires, including the Japanese FCV-19S, the Japanese Depression Anxiety Stress Scales-15 (DASS-15), and the Japanese version of the Kessler 6 (K6). CTT showed that each item of the Japanese FCV-19S had no ceiling and floor effect and was close to the normal distribution, and IRT revealed that each item had an appropriate parameter of discrimination and difficulty. Finally, the Japanese FCV-19S was shown to have an acceptable reliability and moderate good concurrent validity. Consequently, the Japanese FCV-19S has robust psychometric properties and can be useful for early detection of adults impacted by the COVID-19 pandemic.  相似文献   

Item response theory (IRT) and categorical data factor analysis (CDFA) are complementary methods for the analysis of the psychometric properties of psychiatric measures that purport to measure latent constructs. These methods have been applied to relatively few child and adolescent measures. We provide the first combined IRT and CDFA analysis of a clinical measure (the Short Mood and Feelings Questionnaire—SMFQ) in a community sample of 7-through 11-year-old children. Both latent variable models supported the internal construct validity of a single underlying continuum of severity of depressive symptoms. SMFQ items discriminated well at the more severe end of the depressive latent trait. Item performance was not affected by age, although age correlated significantly with latent SMFQ scores suggesting that symptom severity increased within the age period of 7–11. These results extend existing psychometric studies of the SMFQ and confirm its scaling properties as a potential dimensional measure of symptom severity of childhood depression in community samples.  相似文献   

Pigeons produced a stimulus change either by responding or by not responding for a specified time period (by pausing). They then had to choose between two responses to obtain food. One choice was correct if the first component had been completed by a response; the other was correct if the component had been completed by a pause. The pigeons usually chose correctly, thereby indicating that they used their own prior behavior as a discriminative stimulus. Fixed pause requirements did not produce equal first component completions by a response and by a pause. To obtain equality, the pause requirement was titrated as a function of current performance. Titration resulted in equal completions and also produced accurate discrimination. In addition to showing that pigeons discriminated whether they had responded or paused, the data displayed and discontinuous functions predicted by catastrophe theory. Another procedure used forced choice rather than titration to produce equal completions by pausing and responding and also showed accurate discrimination of behavior.  相似文献   

