首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A hybrid procedure for number correct scoring is proposed. The proposed scoring procedure is based on both classical true-score theory (CTT) and multidimensional item response theory (MIRT). Specifically, the hybrid scoring procedure uses test item weights based on MIRT and the total test scores are computed based on CTT. Thus, what makes the hybrid scoring method attractive is that this method accounts for the dimensionality of the test items while test scores remain easy to compute. Further, the hybrid scoring does not require large sample sizes once the item parameters are known. Monte Carlo techniques were used to compare and contrast the proposed hybrid scoring method with three other scoring procedures. Results indicated that all scoring methods in this study generated estimated and true scores that were highly correlated. However, the hybrid scoring procedure had significantly smaller error variances between the estimated and true scores relative to the other procedures.  相似文献   

2.
The Gleser-DuBois conditions for selecting from a number of test items those which will maximize the correlation between total test score and criterion will degenerate into expressions requiring only item counts on total distributions and the upper halves of distributions. A grouping convention for scores near medians is recommended. The inefficiency of the method is easily compensated for, because, regardless of the size of the sample, only standard test-scoring equipment and brief computations are required. A procedure is outlined, and some applications are discussed.  相似文献   

3.
Finite sample inference procedures are considered for analyzing the observed scores on a multiple choice test with several items, where, for example, the items are dissimilar, or the item responses are correlated. A discrete p-parameter exponential family model leads to a generalized linear model framework and, in a special case, a convenient regression of true score upon observed score. Techniques based upon the likelihood function, Akaike's information criteria (AIC), an approximate Bayesian marginalization procedure based on conditional maximization (BCM), and simulations for exact posterior densities (importance sampling) are used to facilitate finite sample investigations of the average true score, individual true scores, and various probabilities of interest. A simulation study suggests that, when the examinees come from two different populations, the exponential family can adequately generalize Duncan's beta-binomial model. Extensions to regression models, the classical test theory model, and empirical Bayes estimation problems are mentioned. The Duncan, Keats, and Matsumura data sets are used to illustrate potential advantages and flexibility of the exponential family model, and the BCM technique.The authors wish to thank Ella Mae Matsumura for her data set and helpful comments, Frank Baker for his advice on item response theory, Hirotugu Akaike and Taskin Atilgan, for helpful discussions regarding AIC, Graham Wood for his advice concerning the class of all binomial mixture models, Yiu Ming Chiu for providing useful references and information on tetrachoric models, and the Editor and two referees for suggesting several references and alternative approaches.  相似文献   

4.
刘玥  刘红云 《心理学报》2017,(9):1234-1246
双因子模型可以同时包含一个全局因子和多个局部因子,在描述多维测验结构时有其独特优势,近些年应用越来越广泛。文章基于双因子模型,提出了4种合成总分和维度分的方法,分别是:原始分法,加和法,全局题目加权加和法和局部题目加权加和法,并采用模拟的方法,在样本量、测验长度、维度间相关变化的条件下考察了这些方法与传统多维IRT方法的表现。最后,通过实证研究对结果进行了验证。结果显示:(1)全局加权加和法和局部加权加和法,尤其是局部加权加和法合成的总分和维度分与真值最接近、信度最高。(2)在维度间相关较高,测验长度较长的条件下,局部加权加和法的结果较好,部分条件下甚至优于多维IRT法。(3)仅有局部加权加和法合成的维度分能够反应维度间真实的相关关系。  相似文献   

5.
The construct validity of the short form of the Bruininks-Oseretsky Test of Motor Proficiency for the assessment of gross and fine motor skills was assessed in 377 nondisabled Greek preschool and primary school children (age range 5 yr. to 8:3 mo.) from urban areas of northern Greece. Analysis showed the three factors accounted for 54.1% of the total score variance, agreeing with the earlier findings. Moreover, the item scores had statistically significant relationships with the total short-form score, except for that of copying a circle with the preferred hand. This latter item was also the only one with a small effect size. Age confirmed a statistically significant effect on the scores of the half items of the test battery, also an earlier finding. This test seemed to be a valid test of motor proficiency in normal Greek preschool and primary school children.  相似文献   

6.
刘玥  刘红云 《心理科学》2015,(6):1504-1512
研究旨在探索无铆题情况下,使用构造铆测验法,实现测验分数等值。研究一和研究二分别探索题目难度排序错误、铆题难度差异对构造铆测验法的影响。结果表明:(1)等组条件下,随着错误铆题比例,难度排序错误程度,铆题难度差异增大,构造铆测验法的等值误差逐渐增大,随机等组法的等值误差较为稳定;不等组条件下,构造铆测验法的等值误差均小于随机等组法;(2)对于构造铆测验法,在不等组条件下,铆测验长度越短,等值误差越大。  相似文献   

7.
Yao  Lili  Haberman  Shelby J.  Zhang  Mo 《Psychometrika》2019,84(1):186-211
Psychometrika - In best linear prediction (BLP), a true test score is predicted by observed item scores and by ancillary test data. If the use of BLP rather than a more direct estimate of a true...  相似文献   

8.
Babitz  Milton  Keys  Noel 《Psychometrika》1940,5(4):283-288
It is noted that the average inter-item correlation, which represents the internal consistency of a test, yields a unique estimate of test reliability. A close approximation to this average is given by a formula which requires the correlation of each item with the total score and the standard deviation of each item. The formula is especially useful in those instances where the number of items is small and where the variation in item sigmas should not be neglected.  相似文献   

9.
The current study investigated the impact of requiring respondents to elaborate on their answers to a biodata measure on mean scores, the validity of the biodata item composites, subgroup mean differences, and correlations with social desirability. Results of this study indicate that elaborated responses result in scores that are much lower than nonelaborated responses to the same items by an independent sample. Despite the lower mean score on elaborated items, it does not appear that elaboration affects the size of the correlation between social desirability and responses to biodata items or that it affects criterion-related validity or subgroup mean differences in a practically significant way.  相似文献   

10.
Researchers often include a social desirability measure in personality measures, commonly the Balanced Inventory of Desirable Responding (BIDR), and correlate it with personality items to probe for social desirability of the items. A strong correlation between BIDR scores and a personality item would indicate high item social desirability. The current research assesses the validity of this practice. Results showed that these correlations have high validity only when BIDR scores are calculated as a continuous variable rather than as dichotomized item scores. In addition, self-deception scores have higher validity for detecting item social desirability than do impression management scores. The current research supported the use of the self-deception scores, in particular, to detect highly desirable or undesirable items.  相似文献   

11.
汪文义  宋丽红  丁树良 《心理学报》2016,48(12):1612-1624
介绍多维项目反应理论模型下分类准确性和分类一致性指标, 采用蒙特卡罗方法实现复杂决策规则下指标计算, 并从数学上证明分类准确性指标两类估计量在均匀先验和相同决策规则条件下依概率收敛于同一真值。研究结果表明:分类准确性指标可以比较准确地评价分类结果的准确性; 分类一致性指标可以较好地评价分类结果的重测一致性; 在一定条件下, 基于能力量尺的指标优于基于原始总分的指标; 纵使测验维度增加, 估计精度仍比较好; 随着测验长度和维度间相关增加, 分类准确性和分类一致性更高。指标可以用来评价标准参照测验或计算机分类测验的多种决策规则下分类信度和效度。  相似文献   

12.
Intelligence differences might contribute to true differences in personality traits. It is also possible that intelligence might contribute to differences in understanding and interpreting personality items. Previous studies have not distinguished clearly between these possibilities. Before it can be accepted that scale score differences actually reflect personality differences, personality items should show measurement invariance. The authors used item response theory to test measurement invariance in the five-factor model scales of the International Personality Item Pool (IPIP) and NEO-Five-Factor Inventory (NEO-FFI) across two groups of participants from the Lothian Birth Cohort 1936 with relatively low and high cognitive abilities. Each group consisted of 320 individuals, with equal numbers of men and women. The mean IQ difference of the groups was 21 points. It was found that the IPIP and NEO-FFI items were measurement invariant across all five scales, making it possible to conclude that any differences in IPIP and NEO-FFI scores between people with low and high cognitive abilities reflected personality trait differences.  相似文献   

13.
This paper argues that test data are ordinal, that latent trait scores are only determined ordinally, and that test data are used largely for ordinal purposes. Therefore it is desirable to develop a test theory based only on ordinal assumptions. A set of ordinal assumptions is presented, including an ordinal version of local independence. From these assumptions it is first shown that the gamma-correlation between two tests is the product of their gamma-correlations with the true latent order. The theory is generalized to allow for heterogeneous tests by defining a weighted average local independence. The tau-correlations between total score and the latent order can be found in both homogeneous and heterogeneous cases, and a system of differential item weighting to maximize the tau-correlation between weighted items and the latent order is provided. Thus a purely ordinal test theory seems possible.Part of this work was done while the author was a Visiting Fellow at Macquarrie University. The paper has benefitted from discussions with Professors Thomas J. Reynolds and Roderick P. McDonald and from the comments of several anonymous reviewers.  相似文献   

14.
Does the Movement Assessment Battery for Children (M-ABC) measures what it claims to measure? The concurrent validity of the total impairment score and some of the item scores of the second and third age band of the M-ABC test were investigated. One hundred thirty three children, between 7- and 9-year-old, were assessed with the M-ABC test, a ball catching test and two tasks measuring dynamic balance. Ninety of these children were identified as children with a poor ball catching skill and 43 children were typically developing children. One hundred and seven children were assessed with the second age band of the M-ABC (the 7- and 8-year-old children) and 26 with the third age band (the 9-year-old children). The results of the correlation analysis between the ball catching test, the two dynamic balance tasks and the corresponding items of the M-ABC, varied from non-significant to a highly significant correlation coefficient of -0.74. For some items concurrent validity was established but other items seemed less valid, probably due to a lack of discriminative power. The concurrent validity of the total impairment score of the M-ABC was confirmed for the second age band. Correlation coefficients between the ball catching test, the dynamic balance skills and the M-ABC varied between -0.72 and -0.76. The results for the third age band have to be interpreted with prudence because they were based on only 26 children.  相似文献   

15.
This paper examines psychometric properties of scores derived from calibration curves (overconfidence, calibration, resolution, and slope) and an analogue of overconfidence that is based on a posttest estimate of the proportion of correctly solved items. Four tests from the theory of fluid and crystallized intelligence were used, and two of these tests employed both sequential and simultaneous methods of item presentation. The results indicate that the overconfidence score not only has the highest reliability, but is the only score with a reliability normally considered adequate for use in individual differences research. There is some, albeit weak, difference in subjects' level of overconfidence between sequential and simultaneous methods of item presentation. Correlational evidence confirms our previous findings that overconfidence scores from perceptual and ‘knowledge’ tasks define the same factor. In agreement with the results of Gigerenzer, Hoffrage and Kleinbolting (1991), subjects' post-test estimates of their performance showed lower levels of overconfidence than did the traditional measures based on subjects' confidence judgment responses to individual items. Also, after controlling for the actual test performances, the post-test performance estimates and average confidence ratings were only slightly positively correlated, suggesting that different psychological processes may underlie these two measures. Finally, our results suggest that average confidence over all items in the test may be a more useful measure in individual differences research than scores derived from calibration curves.  相似文献   

16.
Cognitive diagnosis models of educational test performance rely on a binary Q‐matrix that specifies the associations between individual test items and the cognitive attributes (skills) required to answer those items correctly. Current methods for fitting cognitive diagnosis models to educational test data and assigning examinees to proficiency classes are based on parametric estimation methods such as expectation maximization (EM) and Markov chain Monte Carlo (MCMC) that frequently encounter difficulties in practical applications. In response to these difficulties, non‐parametric classification techniques (cluster analysis) have been proposed as heuristic alternatives to parametric procedures. These non‐parametric classification techniques first aggregate each examinee's test item scores into a profile of attribute sum scores, which then serve as the basis for clustering examinees into proficiency classes. Like the parametric procedures, the non‐parametric classification techniques require that the Q‐matrix underlying a given test be known. Unfortunately, in practice, the Q‐matrix for most tests is not known and must be estimated to specify the associations between items and attributes, risking a misspecified Q‐matrix that may then result in the incorrect classification of examinees. This paper demonstrates that clustering examinees into proficiency classes based on their item scores rather than on their attribute sum‐score profiles does not require knowledge of the Q‐matrix, and results in a more accurate classification of examinees.  相似文献   

17.
Usually, methods for detection of differential item functioning (DIF) compare the functioning of items across manifest groups. However, the manifest groups with respect to which the items function differentially may not necessarily coincide with the true source of the bias. It is expected that DIF detection under a model that includes a latent DIF variable is more sensitive to this source of bias. In a simulation study, it is shown that a mixture item response theory model, which includes a latent grouping variable, performs better in identifying DIF items than DIF detection methods using manifest variables only. The difference between manifest and latent DIF detection increases as the correlation between the manifest variable and the true source of the DIF becomes smaller. Different sample sizes, relative group sizes, and significance levels are studied. Finally, an empirical example demonstrates the detection of heterogeneity in a minority sample using a latent grouping variable. Manifest and latent DIF detection methods are applied to a Vocabulary test of the General Aptitude Test Battery (GATB).  相似文献   

18.
Causal theories of measurement view test items as effects of a common cause. Behavior domain theories view test item responses as behaviors sampled from a common domain. A domain score is a composite score over this domain. The question arises whether latent variables can simultaneously constitute domain scores and common causes of item scores. One argument to the contrary holds that behavior domain theory offers more effective guidance for item construction than a causal theory of measurement. A second argument appeals to the apparent circularity of taking a domain score, which is defined in terms of a domain of behaviors, as a cause of those behaviors. Both arguments require qualification and behavior domain theory seems to rely on implicit causal relationships in two respects. Three strategies permit reconciliation of the two theories: One can take a causal structure as providing the basis for a homogeneous domain. One can construct a homogeneous domain and then investigate whether a causal structure explains the homogeneity. Or, one can take the domain score as linked to an existing attribute constrained by indirect measurement.  相似文献   

19.
In assessments of attitudes, personality, and psychopathology, unidimensional scale scores are commonly obtained from Likert scale items to make inferences about individuals' trait levels. This study approached the issue of how best to combine Likert scale items to estimate test scores from the practitioner's perspective: Does it really matter which method is used to estimate a trait? Analyses of 3 data sets indicated that commonly used methods could be classified into 2 groups: methods that explicitly take account of the ordered categorical item distributions (i.e., partial credit and graded response models of item response theory, factor analysis using an asymptotically distribution-free estimator) and methods that do not distinguish Likert-type items from continuously distributed items (i.e., total score, principal component analysis, maximum-likelihood factor analysis). Differences in trait estimates were found to be trivial within each group. Yet the results suggested that inferences about individuals' trait levels differ considerably between the 2 groups. One should therefore choose a method that explicitly takes account of item distributions in estimating unidimensional traits from ordered categorical response formats. Consequences of violating distributional assumptions were discussed.  相似文献   

20.
The relation between item difficulty distributions and the validity and reliability of tests is computed through use of normal correlation surfaces for varying numbers of items and varying degrees of item intercorrelations. Optimal or near optimal item difficulty distributions are thus identified for various possible item difficulty distributions. The results indicate that, if a test is of conventional length, is homogeneous as to content, and has a symmetrical distribution of item difficulties, correlation with a normally distributed perfect measure of the attribute common to the items does not vary appreciably with variation in the item difficulty distribution. Greater variation was evident in correlation with a second duplicate test (reliability). The general implications of these findings and their particular significance for evaluating techniques aimed at increasing reliability are considered.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号