首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
While most validity indices are based on total test scores, this paper describes a method for quantifying the construct validity of items. The approach is based on the item selection technique originally described by Piazza in 1980. Unfortunately, Piazza's P2 index suffers from some substantial limitations. The Dm coefficient provides an alternative which can be used for item selection and provides a validity index for a set of items. The index is similar to that of traditional criterion-related validity indices. Criterion-related validity is used to demonstrate the accuracy of hypothesized relations of the measure with outcome variables of interest in research and practice. This method may be useful when the sample of items or persons is small, rendering more traditional approaches such as factor analysis or item response theory inappropriate. An example of how to use the technique is provided.  相似文献   

2.
刘红云  骆方  王玥  张玉 《心理学报》2012,44(1):121-132
作者简要回顾了SEM框架下分类数据因素分析(CCFA)模型和MIRT框架下测验题目和潜在能力的关系模型, 对两种框架下的主要参数估计方法进行了总结。通过模拟研究, 比较了SEM框架下WLSc和WLSMV估计方法与MIRT框架下MLR和MCMC估计方法的差异。研究结果表明:(1) WLSc得到参数估计的偏差最大, 且存在参数收敛的问题; (2)随着样本量增大, 各种项目参数估计的精度均提高, WLSMV方法与MLR方法得到的参数估计精度差异很小, 大多数情况下不比MCMC方法差; (3)除WLSc方法外, 随着每个维度测验题目的增多参数估计的精度逐渐增高; (4)测验维度对区分度参数和难度参数的影响较大, 而测验维度对项目因素载荷和阈值的影响相对较小; (5)项目参数的估计精度受项目测量维度数的影响, 只测量一个维度的项目参数估计精度较高。另外文章还对两种方法在实际应用中应该注意的问题提供了一些建议。  相似文献   

3.
当观测指标变量为二分分类数据时,传统的因素分析方法不再适用。作者简要回顾了SEM框架下的分类数据因素分析模型和IRT框架下的测验题目和潜在能力的关系模型,并对两种框架下主要采用的参数估计方法进行了总结。通过两个模拟研究,比较了SEM框架下GLSc和MGLSc估计方法与IRT框架下MML/EM估计方法的差异。研究结果表明:(1)三种方法中,GLSc得到参数估计的偏差最大,MGLSc和MML/EM估计方法相差不大;(2)随着样本量增大,各种项目参数估计的精度均提高;(3)项目因素载荷和难度估计的精度受测验长度的影响;(4)项目因素载荷和区分度估计的精度受总体因素载荷(区分度)高低的影响;(5)测验项目中阈值的分布会影响参数估计的精度,其中受影响最大的是项目区分度。(6)总体来看,SEM框架下的项目参数估计精度较IRT框架下项目参数估计的精度高。此外,文章还将两种方法在实际应用中应该注意的问题提供了一些建议。  相似文献   

4.
刘红云  李冲  张平平  骆方 《心理学报》2012,44(8):1124-1136
测量工具满足等价性是进行多组比较的前提, 测量等价性的检验方法主要有基于CFA的多组比较法和基于IRT的DIF检验两类方法。文章比较了单维测验情境下基于CCFA的DIFFTEST检验方法和基于IRT模型的IRT-LR检验方法, 以及多维测验情境下DIFFTEST和基于MIRT的卡方检验方法的差异。通过模拟研究的方法, 比较了几种方法的检验力和第一类错误, 并考虑了样本总量、样本量的组间均衡性、测验长度、阈值差异大小以及维度间相关程度的影响。研究结果表明:(1)在单维测验下, IRT-LR是比DIFFTEST更为严格的检验方法; 多维测验下, 在测验较长、测验维度之间相关较高时, MIRT-MG比DIFFTEST更容易检验出项目阈值的差异, 而在测验长度较短、维度之间相关较小时, DIFFTEST的检验力反而略高于MIRT-MG方法。(2)随着阈值差值增加, DIFFTEST、IRT-LR和MIRT-MG三种方法的检验力均在增加, 当阈值差异达到中等或较大时, 三种方法都可以有效检验出测验阈值的不等价性。(3)随着样本总量增加, DIFFTEST、IRT-LR和MIRT-MG方法的检验力均在增加; 在总样本量不变, 两组样本均衡情况下三种方法的检验力均高于不均衡的情况。(4)违背等价性题目个数不变时, 测验越长DIFFTEST的检验力会下降, 而IRT-LR和MIRT-MG检验力则上升。(5) DIFFTEST方法的一类错误率平均值接近名义值0.05; 而IRT-LR和MIRT-MG方法的一类错误率平均值远低于0.05。  相似文献   

5.
In assessments of attitudes, personality, and psychopathology, unidimensional scale scores are commonly obtained from Likert scale items to make inferences about individuals' trait levels. This study approached the issue of how best to combine Likert scale items to estimate test scores from the practitioner's perspective: Does it really matter which method is used to estimate a trait? Analyses of 3 data sets indicated that commonly used methods could be classified into 2 groups: methods that explicitly take account of the ordered categorical item distributions (i.e., partial credit and graded response models of item response theory, factor analysis using an asymptotically distribution-free estimator) and methods that do not distinguish Likert-type items from continuously distributed items (i.e., total score, principal component analysis, maximum-likelihood factor analysis). Differences in trait estimates were found to be trivial within each group. Yet the results suggested that inferences about individuals' trait levels differ considerably between the 2 groups. One should therefore choose a method that explicitly takes account of item distributions in estimating unidimensional traits from ordered categorical response formats. Consequences of violating distributional assumptions were discussed.  相似文献   

6.
The application of psychological measures often results in item response data that arguably are consistent with both unidimensional (a single common factor) and multidimensional latent structures (typically caused by parcels of items that tap similar content domains). As such, structural ambiguity leads to seemingly endless "confirmatory" factor analytic studies in which the research question is whether scale scores can be interpreted as reflecting variation on a single trait. An alternative to the more commonly observed unidimensional, correlated traits, or second-order representations of a measure's latent structure is a bifactor model. Bifactor structures, however, are not well understood in the personality assessment community and thus rarely are applied. To address this, herein we (a) describe issues that arise in conceptualizing and modeling multidimensionality, (b) describe exploratory (including Schmid-Leiman [Schmid & Leiman, 1957] and target bifactor rotations) and confirmatory bifactor modeling, (c) differentiate between bifactor and second-order models, and (d) suggest contexts where bifactor analysis is particularly valuable (e.g., for evaluating the plausibility of subscales, determining the extent to which scores reflect a single variable even when the data are multidimensional, and evaluating the feasibility of applying a unidimensional item response theory (IRT) measurement model). We emphasize that the determination of dimensionality is a related but distinct question from either determining the extent to which scores reflect a single individual difference variable or determining the effect of multidimensionality on IRT item parameter estimates. Indeed, we suggest that in many contexts, multidimensional data can yield interpretable scale scores and be appropriately fitted to unidimensional IRT models.  相似文献   

7.
In every cross-cultural study, the question as to whether test scores obtained in different cultural populations can be interpreted in the same way across these populations has to be dealt with. Bias and equivalence have become the common terms to refer to the issue. Taxonomy of both bias and equivalence is presented. Bias can be engendered by the theoretical construct (construct bias), the method such as the form of test administration (method bias), and the item content (item bias). Equivalence refers to the measurement level at which scores can be compared across cultures. Three levels of equivalence are possible: the same construct is measured in each cultural group but the functional form of the relationship between scores obtained in various groups is unknown (structural equivalence), scores have the same measurement unit across populations but have different origins (measurement unit equivalence), and scores have the same measurement unit and origin in all populations (full scale equivalence). The most frequently encountered sources of bias and their remedies are described.  相似文献   

8.
Several authors have suggested that prior to conducting a confirmatory factor analysis it may be useful to group items into a smaller number of item ‘parcels’ or ‘testlets’. The present paper mathematically shows that coefficient alpha based on these parcel scores will only exceed alpha based on the entire set of items if W, the ratio of the average covariance of items between parcels to the average covariance of items within parcels, is greater than unity. If W is less than unity, however, and errors of measurement are uncorrelated, then stratified alpha will be a better lower bound to the reliability of a measure than the other two coefficients. Stratified alpha are also equal to the true reliability of a test when items within parcels are essentially tau‐equivalent if one assumes that errors of measurement are not correlated.  相似文献   

9.
In this study, we compared classical test theory (CTT) and item response theory (IRT) approaches in analyzing the Center for Epidemiological Studies Depression (CES-D) Scale (Radloff, 1977). Standard item analyses, as well as Rasch (1960) analyses, both revealed item departures from unidimensionality in a sample of 2,455 older persons responding to the CES-D. Positive affect items in the scale performed poorly overall, their removal reducing the scale's bandwidth only slightly. Modeling depression scores derived from Rasch measures and raw totals showed subtle but important differences for statistical inference. The assessment of depressive risk was slightly enhanced by using 16-item scale measures obtained from the results of the Rasch analysis as the dependent variable. Confirmatory factor analysis and parallel analysis verified the advantages of removing positively worded items. IRT and CTT techniques proved to be complementary in this study and can be usefully combined to improve measuring depression.  相似文献   

10.
毛秀珍  刘欢  唐倩 《心理科学》2019,(1):187-193
双因子模型假设测验考察一个一般因子和多个组因子,符合很多教育和心理测验的因素结构。“维度缩减”方法将参数估计中多维积分计算化简为多个迭代二维积分,是双因子模型的重要特征。本文针对考察多级评分项目的计算机化自适应测验,首先推导双因子等级反应模型下Fisher信息量的计算,然后推导“维度缩减”方法在项目选择方法中的应用,最后在低、中、高双因子模式题库中比较D-优化方法、后验加权Fisher信息D优化方法(PDO)、后验加权Kullback-Leibler方法(PKL)、连续熵(CEM)和互信息(MI)方法在能力估计的相关、均方根误差、绝对值偏差和欧氏距离的表现。模拟研究表明:(1)双因子模式越强,即一般因子和组因子在项目上的区分度的差异越小,一般因子估计精度降低,组因子估计精度增加,整体能力的估计精度提高;(2)相同实验条件下,连续熵方法的测量精度最高,PKL方法的能力估计精度最低,其它方法的测量精度没有显著差异。  相似文献   

11.
The paper addresses three neglected questions from IRT. In section 1, the properties of the “measurement” of ability or trait parameters and item difficulty parameters in the Rasch model are discussed. It is shown that the solution to this problem is rather complex and depends both on general assumptions about properties of the item response functions and on assumptions about the available item universe. Section 2 deals with the measurement of individual change or “modifiability” based on a Rasch test. A conditional likelihood approach is presented that yields (a) an ML estimator of modifiability for given item parameters, (b) allows one to test hypotheses about change by means of a Clopper-Pearson confidence interval for the modifiability parameter, or (c) to estimate modifiability jointly with the item parameters. Uniqueness results for all three methods are also presented. In section 3, the Mantel-Haenszel method for detecting DIF is discussed under a novel perspective: What is the most general framework within which the Mantel-Haenszel method correctly detects DIF of a studied item? The answer is that this is a 2PL model where, however, all discrimination parameters are known and the studied item has the same discrimination in both populations. Since these requirements would hardly be satisfied in practical applications, the case of constant discrimination parameters, that is, the Rasch model, is the only realistic framework. A simple Pearsonx 2 test for DIF of one studied item is proposed as an alternative to the Mantel-Haenszel test; moreover, this test is generalized to the case of two items simultaneously studied for DIF.  相似文献   

12.
The Adolescent Quality of Life-Mental Health Scale (AQOL-MHS) was designed to measure quality of life in clinical samples of Latino adolescents aged 12–18 years, but has also been used in community samples. The original measure included three factors: Emotional Regulation (ER), Self-Concept (SC) and Social Context (SoC). The goals of this study are to replicate the factor structure using confirmatory factor analysis (CFA), shorten the instrument and test the degree of measurement invariance across gender, age, and type of sample. Participants for the analyses (N?=?354) came from two populations in the San Juan Metropolitan Area: (1) adolescents from randomly selected households, using a multi-stage probability sampling design (n?=?295), and (2) adolescents receiving treatment at mental health clinics (n?=?59). We first carried out a conceptual item analysis for item reduction purposes and then assessed dimensional, configural, metric and scalar invariance for each factor using the Mplus software system. The original 3-factor structure was replicated with comparable model fit in each treatment context. Metric invariance was attained for all three scales across groups. Either full or partial scalar invariance was also observed with DIF in a total of 6 items. Invariance testing supports the use of the abridged 21 item version of the AQOL-MHS to compare diverse individuals with little bias using observed scores, but for refined estimates the ideal scoring will be from a latent variable model.  相似文献   

13.
针对双目标CD-CAT,将六种项目区分度(鉴别力D、一般区分度GDI、优势比OR、2PL的区分度a、属性区分度ADI、认知诊断区分度CDI)分别与IPA方法结合,得到新的选题策略。模拟研究比较了它们的表现,还考察了区分度分层在控制项目曝光的表现。结果发现:新方法都能明显提高知识状态的判准率和能力估计精度;分层选题均能很好地提高题库利用率。总体上,OR加权能显著提高测量精度;OR分层选题在保证测量精度条件下显著提高项目曝光均匀性。  相似文献   

14.
This article (a) describes how McDonald's nonlinear factor analytic approach to the normal ogive curve can be used to factor analyse total test scores, (b) discusses the conditions in which this model is more appropriate than the widely used linear model, and (c) illustrates the applicability of both models using an empirical example. The rationale for the described procedure is that the test scores are simple sums of binary item responses whose item characteristic curves are adequately represented by normal ogives. The results obtained in the empirical example are meaningful and informative, and agree with the results obtained at the item level.  相似文献   

15.
Visual perceptual skills of school-age children are often assessed using the Supplemental Developmental Test of Visual Perception of the Developmental Test of Visual-Motor Integration. The study purpose was to consider the construct validity of this test by evaluating its scalability (interval level measurement), unidimensionality, differential item functioning, and hierarchical ordering of its items. Visual perceptual performance scores from a sample of 356 typically developing children (171 boys and 185 girls ages 5 to 11 years) were used to complete a Rasch analysis of the test. Seven items were discarded for poor fit, while none of the items exhibited differential item functioning by sex. The construct validity, scalability, hierarchical ordering, and lack of differential item functioning requirements were met by the final test version. Since 7 test items did not fit the Rasch analysis specifications, the clinical value of the test is questionable and limited.  相似文献   

16.
本研究以4岁~5岁儿童认知能力测验为例,在IRT框架下探讨了如何进行追踪数据的测量不变性分析。分析模型采用项目间多维项目反应理论模型(between-item MIRT model)和项目内(within-item MIRT model)多维two-tier model,被试为来自全国的882名48个月的儿童,工具为自编4岁~5岁儿童认知能力测验。经测验水平 分析和项目水平分析,结果表明:(1)本文对追踪数据的测量不变性分析方法合理有效; (2)该测验在两个时间点上满足部分测量不变性要求,测验的潜在结构稳定; (3)“方位题”的区分度和难度参数都发生变化,另有4题难度参数出现浮动; (4)儿童在4岁~5岁期间认知能力总体呈快速发展趋势,能力增长显著。  相似文献   

17.
罗芬  王晓庆  蔡艳  涂冬波 《心理学报》2020,52(12):1452-1465
双目标CD-CAT的测验结果既可用于形成性评估也可用于终结性评估。基尼指数可度量随机变量的不确定性程度, 值越小则随机变量的不确定程度越低。本文用基尼指数度量被试知识状态类别以及能力估计置信区间后验概率的变化, 提出基于基尼指数的选题策略。Monte Carlo实验表明与已有的选题策略相比, 新策略的知识状态分类精度和能力估计精度都较高, 同时能有效兼顾题库利用均匀性, 并能快速实时响应, 且受认知诊断模型和被试知识状态分布的影响较小, 可用于实际测验中含多种认知诊断模型的混合题库。  相似文献   

18.
Several approaches exist to model interactions between latent variables. However, it is unclear how these perform when item scores are skewed and ordinal. Research on Type D personality serves as a good case study for that matter. In Study 1, we fitted a multivariate interaction model to predict depression and anxiety with Type D personality, operationalized as an interaction between its two subcomponents negative affectivity (NA) and social inhibition (SI). We constructed this interaction according to four approaches: (1) sum score product; (2) single product indicator; (3) matched product indicators; and (4) latent moderated structural equations (LMS). In Study 2, we compared these interaction models in a simulation study by assessing for each method the bias and precision of the estimated interaction effect under varying conditions. In Study 1, all methods showed a significant Type D effect on both depression and anxiety, although this effect diminished after including the NA and SI quadratic effects. Study 2 showed that the LMS approach performed best with respect to minimizing bias and maximizing power, even when item scores were ordinal and skewed. However, when latent traits were skewed LMS resulted in more false-positive conclusions, while the Matched PI approach adequately controlled the false-positive rate.  相似文献   

19.
叶宝娟  温忠麟 《心理学报》2012,44(12):1687-1694
在决定将多维测验分数合并成测验总分时, 应当考虑测验同质性。如果同质性太低, 合成总分没有什么意义。同质性高低可以用同质性系数来衡量。用来计算同质性系数的模型是近年来受到关注的双因子模型(既有全局因子又有局部因子), 测验的同质性系数定义为测验分数方差中全局因子分数方差所占的比例。本文用Delta法推导出计算同质性系数的标准误公式, 进而计算其置信区间。提供了简单的计算同质性系数及其置信区间的程序。用一个例子说明如何估计同质性系数及其置信区间, 通过模拟比较了用Delta法和用Bootstrap法计算的置信区间, 发现两者差异很小。  相似文献   

20.
According to Wollack and Schoenig (2018, The Sage encyclopedia of educational research, measurement, and evaluation. Thousand Oaks, CA: Sage, 260), benefiting from item preknowledge is one of the three broad types of test fraud that occur in educational assessments. We use tools from constrained statistical inference to suggest a new statistic that is based on item scores and response times and can be used to detect examinees who may have benefited from item preknowledge for the case when the set of compromised items is known. The asymptotic distribution of the new statistic under no preknowledge is proved to be a simple mixture of two χ2 distributions. We perform a detailed simulation study to show that the Type I error rate of the new statistic is very close to the nominal level and that the power of the new statistic is satisfactory in comparison to that of the existing statistics for detecting item preknowledge based on both item scores and response times. We also include a real data example to demonstrate the usefulness of the suggested statistic.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号