首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The purpose of this study is to explore patterns in model-data fit related to subgroups of test takers from a large-scale writing assessment. Using data from the SAT, a calibration group was randomly selected to represent test takers who reported that English was their best language from the total population of test takers (N = 322,011). A reference scale for the items was constructed based on EBL responses. Response behaviors of test takers who reported that English was not their best language (ENBL) were examined in relationship to this reference scale. This study illustrates the use of differential subgroup analyses to identify patterns related to person misfit within subgroups, as well as subsets of items, that may affect the validity of writing scores for ENBL test takers. The methodology described here offers an approach that can be used to explore, understand, and improve the validity of scores obtained from ENBL test takers in large-scale writing assessments.  相似文献   

2.
Researchers apply individual person fit analyses as a procedure for checking model-data fit for individual test-takers. When a test-taker misfits, it means that the inferences from their test score regarding what they know and can do may not be accurate. One problem in applying individual person fit procedures in practice is the question of how much misfit it takes to make the test score an untrustworthy estimate of achievement. In this paper, we argue that if a person’s responses generally follow a monotonic pattern, the resulting test score is “good enough” to be interpreted and used. We present an approach that applies statistical procedures from the Rasch and Mokken measurement perspectives to examine individual person fit based on this good enough criterion in real data from a performance assessment. We discuss how these perspectives may facilitate thinking about applying individual person fit procedures in practice.  相似文献   

3.
The Team Role Self Perception Inventory (TRSPI) has attracted several studies critical of its psychometric properties. This research uses a large data set and employs confirmatory factor analysis on within‐scale scores to examine the dimensionality and reliability of the TRSPI's scales. Data show that five of the nine scales are unidimensional and that two other scales show generally good fit to a unidimensional solution. The ‘completer‐finisher’ and ‘implementer’ scales show a better fit to a bidimensional structure and would benefit from improved item wording for a small number of items. The ‘shaper’ scale would also benefit from some attention to item wording. Reliability estimates suggest that the reliability of the TRSPI's scales is better than previous estimates imply.  相似文献   

4.
This article describes a model for response times that is proposed as a supplement to the usual factor-analytic model for responses to graded or more continuous typical-response items. The use of the proposed model together with the factor model provides additional information about the respondent and can potentially increase the accuracy of the individual trait estimates. First, the rationale of the model is discussed in relation to previous developments in binary responses. Second, procedures for fitting the model and for assessing model-data fit at both overall and individual level (person-fit) are proposed. Third, the usefulness of the model and its potential applications in the typical-response domain are discussed. All the proposed developments are used in 2 empirical applications in the personality domain. The first application analyzes 2 scales from a Big Five questionnaire. The second example analyzes a sociability scale developed from Eysenck's questionnaires.  相似文献   

5.
In personality and attitude measurement, the presence of acquiescent responding can have an impact on the whole process of item calibration and test scoring, and this can occur even when sensible procedures for controlling acquiescence are used. This paper considers a bidimensional (content acquiescence) factor‐analytic model to be the correct model, and assesses the effects of fitting unidimensional models to theoretically unidimensional scales when acquiescence is in fact operating. The analysis considers two types of scales: non‐balanced and fully balanced. The effects are analysed at both the calibration and the scoring stages, and are of two types: bias in the item/respondent parameter estimates and model/person misfit. The results obtained theoretically are checked and assessed by means of simulation. The results and predictions are then assessed in an empirical study based on two personality scales. The implications of the results for applied personality research are discussed.  相似文献   

6.
The main aim of this article is to explicate why a transition to ideal point methods of scale construction is needed to advance the field of personality assessment. The study empirically demonstrated the substantive benefits of ideal point methodology as compared with the dominance framework underlying traditional methods of scale construction. Specifically, using a large, heterogeneous pool of order items, the authors constructed scales using traditional classical test theory, dominance item response theory (IRT), and ideal point IRT methods. The merits of each method were examined in terms of item pool utilization, model-data fit, measurement precision, and construct and criterion-related validity. Results show that adoption of the ideal point approach provided a more flexible platform for creating future personality measures, and this transition did not adversely affect the validity of personality test scores.  相似文献   

7.
Commonly used techniques for analyzing the structure of the MMPI scales were discussed and the use of a latent trait model was suggested as an alternative. The items on each scale of the MMPI were calibrated using a discrimination statistic. The item calibration statistics obtained from a replication sample were highly correlated with those obtained in the first sample. Poor fitting items were identified, and possible reasons for poor fits were discussed. The scales generally had few poor fits. The poor fitting items were generally those identified by Wiener (1956) as comprising the "subtle" subscales of the test.  相似文献   

8.

Objective

The Coping Scale for Chinese Athletes (CSCA) was developed and validated using classic testing theory in 2004 (Chung, Si, Lee, & Liu, 2004). This study aimed to validate CSCA using multidimensional Rasch analysis with the ConQuest software programme.

Method

The sample in this study comprised 367 athletes from mainland China. A Multidimensional Rating Scale model was applied to investigate the validity of the four-dimension scale. Standard fit statistics (Infit and Outfit MNSQ) and Differential item functioning (DIF) were computed to examine the model-data fit. Test reliability and category functioning were also checked.

Results

The item difficulty and the athletes’ trait level of coping were calibrated along the same latent trait scale. Three items were removed from the scale due to misfit with the Rasch model. No DIF across gender was found for the remaining 21 items. Test reliabilities for the four subscales ranged from 0.66 to 0.76. The results also indicated that the original 5-category rating scale structure did not function well.

Conclusion

The multidimensional Rasch analysis supported that the 21-item CSCA measures four latent traits of coping of Chinese athletes as expected. The results also demonstrated advantages of multidimensional Rasch analysis over unidimensional Rasch analysis as well as traditional approach in examining the quality of multidimensional scale in sport settings.  相似文献   

9.
In practice, the sum of the item scores is often used as a basis for comparing subjects. For items that have more than two ordered score categories, only the partial credit model (PCM) and special cases of this model imply that the subjects are stochastically ordered on the common latent variable. However, the PCM is very restrictive with respect to the constraints that it imposes on the data. In this paper, sufficient conditions for the stochastic ordering of subjects by their sum score are obtained. These conditions define the isotonic (nonparametric) PCM model. The isotonic PCM is more flexible than the PCM, which makes it useful for a wider variety of tests. Also, observable properties of the isotonic PCM are derived in the form of inequality constraints. It is shown how to obtain estimates of the score distribution under these constraints by using the Gibbs sampling algorithm. A small simulation study shows that the Bayesian p-values based on the log-likelihood ratio statistic can be used to assess the fit of the isotonic PCM to the data, where model-data fit can be taken as a justification of the use of the sum score to order subjects.  相似文献   

10.
该研究旨在编制情绪感染问卷并探讨问卷的信度与效度,问卷以Doherty编制的《情绪感染问卷》(the emotional contagion scale)作为蓝本,删除了不适合东方文化的项目,增加了符合中国人的情绪情境和展现情绪方式的项目,经过预测修订后,最终问卷包含5个维度25个项目,正式施测共回收747份有效问卷,并对数据进行了信效度检验。结果表明,探索性因素分析KMO为0.802,总问卷的标准化的Cronbach’sα系数为0.852,项目分析表明项目的鉴别指数D值在0.340~0.479之间,验证性因素分析具有较好的模型拟合度,问卷的重测信度与效标关联效度均达到显著水平。情绪感染问卷的信效度均达到了测量学的要求。  相似文献   

11.
12.
This paper examines psychometric properties of scores derived from calibration curves (overconfidence, calibration, resolution, and slope) and an analogue of overconfidence that is based on a posttest estimate of the proportion of correctly solved items. Four tests from the theory of fluid and crystallized intelligence were used, and two of these tests employed both sequential and simultaneous methods of item presentation. The results indicate that the overconfidence score not only has the highest reliability, but is the only score with a reliability normally considered adequate for use in individual differences research. There is some, albeit weak, difference in subjects' level of overconfidence between sequential and simultaneous methods of item presentation. Correlational evidence confirms our previous findings that overconfidence scores from perceptual and ‘knowledge’ tasks define the same factor. In agreement with the results of Gigerenzer, Hoffrage and Kleinbolting (1991), subjects' post-test estimates of their performance showed lower levels of overconfidence than did the traditional measures based on subjects' confidence judgment responses to individual items. Also, after controlling for the actual test performances, the post-test performance estimates and average confidence ratings were only slightly positively correlated, suggesting that different psychological processes may underlie these two measures. Finally, our results suggest that average confidence over all items in the test may be a more useful measure in individual differences research than scores derived from calibration curves.  相似文献   

13.
计算机化自适应测验中原始题项目参数的估计   总被引:1,自引:1,他引:0  
计算机化自适应测验(Computerized Adaptive Testing, 简称CAT)其安全性面临着新的挑战, 小题库的安全更受威胁。如何建设一个大型、优质的题库成为CAT研究中一个非常重要的课题。目前CAT题库的建设存在一些问题, 如成本高且保密性较差。尤其是等值技术较复杂且锚题重复使用容易造成泄露。如能在实施CAT过程中插入未经过参数估计的项目(原始题), 同时对原始题项目参数进行估计, 这对建设大型、优质的CAT题库来说其意义是不言而喻的。本文基于1PLM和2PLM对此进行研究, 提出了原始题在线估计的新方法以及推导出了求区分度参数a迭代初值的计算公式。研究结果表明:无论是模拟研究还是实证研究, 原始题被作答的次数对项目参数估计结果都会产生不同的影响, 并且原始题作答人数越多项目参数估计精度也越高。  相似文献   

14.
In assessments of attitudes, personality, and psychopathology, unidimensional scale scores are commonly obtained from Likert scale items to make inferences about individuals' trait levels. This study approached the issue of how best to combine Likert scale items to estimate test scores from the practitioner's perspective: Does it really matter which method is used to estimate a trait? Analyses of 3 data sets indicated that commonly used methods could be classified into 2 groups: methods that explicitly take account of the ordered categorical item distributions (i.e., partial credit and graded response models of item response theory, factor analysis using an asymptotically distribution-free estimator) and methods that do not distinguish Likert-type items from continuously distributed items (i.e., total score, principal component analysis, maximum-likelihood factor analysis). Differences in trait estimates were found to be trivial within each group. Yet the results suggested that inferences about individuals' trait levels differ considerably between the 2 groups. One should therefore choose a method that explicitly takes account of item distributions in estimating unidimensional traits from ordered categorical response formats. Consequences of violating distributional assumptions were discussed.  相似文献   

15.
Classical item analysis procedures were developed for dichotomously scored items and do not apply to items allowing multiple correct responses. Maximum likelihood procedures analogous to those employed in polychotomous bio-assay are presented which yield estimates of the sets of parameters for items having multiple nonordered responses. Expressions for the estimates of the asymptotic variances of the item parameters and on overall chi-square goodness of fit test are also provided.  相似文献   

16.
本文将IRT中表现较好的CVLL法引入到认知诊断领域,同时比较并分析CVLL及认知诊断领域已有的测验相对拟合检验统计量的表现,为实际工作者在认知诊断模型选用上提供方法学支持和借鉴。结果表明:CVLL的表现比其它传统测验相对拟合统计量要好;且当对Q矩阵进行误设时,该统计量也能选择较优的Q矩阵,说明CVLL在Q矩阵侦查上有较好的应用前景。  相似文献   

17.
In item response theory (IRT), the invariance property states that item parameter estimates are independent of the examinee sample, and examinee ability estimates are independent of the test items. While this property has long been established and understood by the measurement community for IRT models, the same cannot be said for diagnostic classification models (DCMs). DCMs are a newer class of psychometric models that are designed to classify examinees according to levels of categorical latent traits. We examined the invariance property for general DCMs using the log-linear cognitive diagnosis model (LCDM) framework. We conducted a simulation study to examine the degree to which theoretical invariance of LCDM classifications and item parameter estimates can be observed under various sample and test characteristics. Results illustrated that LCDM classifications and item parameter estimates show clear invariance when adequate model data fit is present. To demonstrate the implications of this important property, we conducted additional analyses to show that using pre-calibrated tests to classify examinees provided consistent classifications across calibration samples with varying mastery profile distributions and across tests with varying difficulties.  相似文献   

18.
自陈量表式测验应用IRT的可行性   总被引:6,自引:1,他引:5  
对采用5级评分Likert式测题的情感能力量表的分析表明,各分量表项目都有较好的模型-数据拟合性,而且显示了参数估计的不变性,及与CTT参数的关联性。这些都表明Likert量表应用IRT模型的假设条件得到了满足,即IRT应用是可行的。研究还表明IRT能对测量精度进行更精确的估计。  相似文献   

19.
Researchers from 13 countries collaborated in constructing a psychometric scale to measure career adaptability. Based on four pilot tests, a research version of the proposed scale consisting of 55 items was field tested in 13 countries. The resulting Career Adapt-Abilities Scale (CAAS) consists of four scales, each with six items. The four scales measure concern, control, curiosity, and confidence as psychosocial resources for managing occupational transitions, developmental tasks, and work traumas. The CAAS demonstrated metric invariance across all the countries, but did not exhibit residual/strict invariance or scalar invariance. The reliabilities of the CAAS subscales and the combined adaptability scale range from acceptable to excellent when computed with the combined data. As expected, the reliability estimates varied across countries. Nevertheless, the internal consistency estimates for the four subscales of concern, control, curiosity, and confidence were generally acceptable to excellent. The internal consistency estimates for the CAAS total score were excellent across all countries. Separate articles in this special issue report the psychometric characteristics of the CAAS, including initial validity evidence, for each of the 13 countries that collaborated in constructing the Scale.  相似文献   

20.
Person-fit statistics have been proposed to investigate the fit of an item score pattern to an item response theory (IRT) model. The author investigated how these statistics can be used to detect different types of misfit. Intelligence test data were analyzed using person-fit statistics in the context of the G. Rasch (1960) model and R. J. Mokken's (1971, 1997) IRT models. The effect of the choice of an IRT model to detect misfitting item score patterns and the usefulness of person-fit statisticsfor diagnosis of misfit are discussed. Results showed that different types of person-fit statistics can be used to detect different kinds of person misfit. Parametric person-fit statistics had more power than nonparametric person-fit statistics.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号