期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

E. E. Cureton 《Psychometrika》1966,31(1):93-96

Previous papers on this subject derive the correlation between an item and the remainder of the test. This correlation is unsatisfactory because the reliability of the remainder varies inversely with the reliability of the item omitted. The present paper derives the correlation between an item and the total test, with that item replaced by a rationally equivalent item. The general formula is then modified, for dichotomus items, to give the corrected point-biserial, biserial, and Brogden biserial correlations. The results apply strictly only to factorially homogeneous tests: those in which the same trait or combination of traits is measured (apart from error) by every item. 相似文献

2.

Methods of item validation and abacs for item-test correlation and critical ratio of upper-lower difference

Charles I. Mosier John V. McQuitty 《Psychometrika》1940,5(1):57-65

It is shown that by making the assumption that the knowledge of the test-item and the knowledge of the entire test are both distributed normally, the correlation coefficient between any item and the entire test can be expressed as a function solely of two proportions — the percentage of a high-scoring group passing the item and the percentage of a low-scoring group passing the item. This function is expressed graphically as a family of curves for each of two conditions — where the high-scoring and low-scoring groups are samples of the highest and the lowest quarters respectively, and where they are samples from the upper and lower halves. It is shown, moreover, that two other common measures of item validity, the upper-lower difference and the critical ratio of the upper-lower difference, may be drawn on the same coordinate axes. 相似文献

3.

A graphical method for the rapid calculation of biserial and point biserial correlation in test research

GOHEEN HW DAVIDOFF MD 《Psychometrika》1951,16(2):239-242

相似文献

4.

Estimating the reliability of a test split into two parts of equal or unequal length

Feldt LS Charter RA 《心理学方法》2003,8(1):102-109

When the reliability of test scores must be estimated by an internal consistency method, partition of the test into just 2 parts may be the only way to maintain content equivalence of the parts. If the parts are classically parallel, the Spearman-Brown formula may be validly used to estimate the reliability of total scores. If the parts differ in their standard deviations but are tau equivalent, Cronbach's alpha is appropriate. However, if the 2 parts are congeneric, that is, they are unequal in functional length or they comprise heterogeneous item types, a less well-known estimate, the Angoff-Feldt coefficient, is appropriate. Guidelines in terms of the ratio of standard deviations are proposed for choosing among Spearman-Brown, alpha, and Angoff-Feldt coefficients. 相似文献

5.

The relation of item difficulty and inter-item correlation to test variance and reliability 总被引：1，自引：0，他引：1

Harold Gulliksen 《Psychometrika》1945,10(2):79-91

Under assumptions that will hold for the usual test situation, it is proved that test reliability and variance increase (a) as the average inter-item correlation increases, and (b) as the variance of the item difficulty distribution decreases. As the average item variance increases, the test variance will increase, but the test reliability will not be affected. (It is noted that as the average item variance increases, the average item difficulty approaches .50). In this development, no account is taken of the effect of chance success, or the possible effect on student attitude of different item difficulty distributions. In order to maximize the reliability and variance of a test, the items should have high intercorrelations, all items should be of the same difficulty level, and the level should be as near to 50% as possible.The desirability of determining this relationship has been indicated by previous writers. Work on the present paper arose out of some problems raised by Dr. Herbert S. Conrad in connection with an analysis of aptitude tests.On leave for Government war research from the Psychology Department, University of Chicago. 相似文献

6.

双因子模型MCAT中多级评分项目选题策略的比较

毛秀珍刘欢唐倩《心理科学》2019,(1):187-193

双因子模型假设测验考察一个一般因子和多个组因子,符合很多教育和心理测验的因素结构。“维度缩减”方法将参数估计中多维积分计算化简为多个迭代二维积分,是双因子模型的重要特征。本文针对考察多级评分项目的计算机化自适应测验,首先推导双因子等级反应模型下Fisher信息量的计算,然后推导“维度缩减”方法在项目选择方法中的应用,最后在低、中、高双因子模式题库中比较D-优化方法、后验加权Fisher信息D优化方法(PDO)、后验加权Kullback-Leibler方法(PKL)、连续熵(CEM)和互信息(MI)方法在能力估计的相关、均方根误差、绝对值偏差和欧氏距离的表现。模拟研究表明：(1)双因子模式越强,即一般因子和组因子在项目上的区分度的差异越小,一般因子估计精度降低,组因子估计精度增加,整体能力的估计精度提高;(2)相同实验条件下,连续熵方法的测量精度最高,PKL方法的能力估计精度最低,其它方法的测量精度没有显著差异。相似文献

7.

多级评分模型下的题库结构对CAT的影响分析

程小扬丁树良巫华芳朱隆尹《心理学探新》2014,34(5):452-456

在多级评分模型下,项目的难度参数或步骤参数有多个,在对多级评分模型进行选题时,通常对项目的多个难度参数用一个综合的指标来概括.当对每个项目的难度参数进行有效的综合后,综合后的难度参数分布发生了改变,这时如果增加适量的平均难度较难或较易的项目进入题库,测验的精度和项目的曝光率都有一定的改善. 相似文献

8.

Test reliability estimated by analysis of variance

Cyril Hoyt 《Psychometrika》1941,6(3):153-160

A formula for estimating the reliability of a test, based on the analysis of variance theory, is developed and illustrated. The data needed for the required computation are the number of correct responses to each item and the score for each subject. The results obtained from this formula are identical with those from one of the special cases of the Kuder-Richardson formulation. The relationships of the new procedure to other approaches to the problem are indicated. 相似文献

9.

自适应分组认知诊断测验设计及其选题策略

罗芬王晓庆丁树良熊建华《心理科学》2018,(3):720-726

应用OMST在线装配模式,提出自适应分组认知诊断测验（CD-AMGT）。由于知识状态的先决关系是偏序关系,而且构成格（lattice),利用知识状态当前估计值在格中的上下确界对被试真实知识状态的可能范围进行界定,由此装配下一分组,分组中结合PWKL策略或SHE策略进行选题以兼顾诊断精度、效率和安全性。模拟实验表明,CD-AMGT与PWKL、SHE对比,当题目类型丰富时,以分类准确率略微降低为代价,其题库使用均匀性和计算用时均表现出较大优势。相似文献

10.

Sufficiency and Conditional Estimation of Person Parameters in the Polytomous Rasch Model

David Andrich 《Psychometrika》2010,75(2):292-308

Rasch models are characterised by sufficient statistics for all parameters. In the Rasch unidimensional model for two ordered categories, the parameterisation of the person and item is symmetrical and it is readily established that the total scores of a person and item are sufficient statistics for their respective parameters. In contrast, in the unidimensional polytomous Rasch model for more than two ordered categories, the parameterisation is not symmetrical. Specifically, each item has a vector of item parameters, one for each category, and each person only one person parameter. In addition, different items can have different numbers of categories and, therefore, different numbers of parameters. The sufficient statistic for the parameters of an item is itself a vector. In estimating the person parameters in presently available software, these sufficient statistics are not used to condition out the item parameters. This paper derives a conditional, pairwise, pseudo-likelihood and constructs estimates of the parameters of any number of persons which are independent of all item parameters and of the maximum scores of all items. It also shows that these estimates are consistent. Although Rasch’s original work began with equating tests using test scores, and not with items of a test, the polytomous Rasch model has not been applied in this way. Operationally, this is because the current approaches, in which item parameters are estimated first, cannot handle test data where there may be many scores with zero frequencies. A small simulation study shows that, when using the estimation equations derived in this paper, such a property of the data is no impediment to the application of the model at the level of tests. This opens up the possibility of using the polytomous Rasch model directly in equating test scores. 相似文献

11.

Correction of item-total correlations in item analysis

Sten Henrysson 《Psychometrika》1963,28(2):211-218

The biserial correlation between an item and the total test of which the item is a part tends to be misleadingly high when used in item analysis, since the item is included in the total test. Two formulas with correction for this overlap are derived and compared with Zubin's and Guilford's formulas. One of the new coefficients is invariant to test length. 相似文献

12.

生命意义问卷(修订版)在初中生群体中的信效度:留守与非留守学生的比较分析

陈维何妃霞黄蓉赵守盈《心理学探新》2017,(3):247-253

检验生命意义问卷(修订版)在初中生群体中的信效度,并比较了留守与非留守学生在测量学指标上的差异。采用生命意义问卷(修订版)、超越意义量表、情感调节量表、Rosenberg自尊量表和幸福感指数量表对1300名初中生进行调查,其中有636名留守初中生。探索性因素分析、平行分析和最小平均偏相关分析均表明该量表为双因子结构,验证性因素分析与各类群体拟合良好;与上述效标变量均有显著的正相关;在性别和是否留守学生变量上,个别条目表现出一致性或非一致性条目功能差异;总量表、追寻和拥有意义分量表的δ系数都大于0.9。生命意义问卷(修订版)具有在初中生和留守初中生中均有良好的信效度;可以忽略在性别和是否留守学生变量的条目功能差异;问卷辨识度较高。相似文献

13.

A subset selection technique for scoring items on a multiple choice test

Jean D. Gibbons Ingram Olkin Milton Sobel 《Psychometrika》1979,44(3):259-270

On a multiple-choice test in which each item hask alternative responses, the test taker is permitted to choose any subset which he believes contains the one correct answer. A scoring system is devised that depends on the size of the subset and on whether or not the correct answer is eliminated. The mean and variance of the score per item are obtained. Methods are derived for determining the total number of items that should be included on the test so that the average score on all items can be regarded as a good measure of the subject's knowledge. Efficiency comparisons between conventional and the subset selection scoring procedures are made. The analogous problem ofr > 1 correct answers for each item (withr fixed and known) is also considered.The authors are grateful to M. Aitkin, C. Coombs, F. Lord, and the reviewers for their comments and suggestions. 相似文献

14.

Sequential Computerized Mastery Tests—Three Simulation Studies

《International Journal of Testing》2013,13(1):41-55

A simulation study of a sequential computerized mastery test is carried out with items modeled with the 3 parameter logistic item response theory model. The examinees' responses are either identically distributed, not identically distributed, or not identically distributed together with estimation errors in the item characteristics. The simulations indicated that the observed results from the operating characteristic function differ significantly from the theoretical results, which is probably due to the use of an approximation formula. The mean number of items in a test, the distribution of test length, and the variance depend highly on how well we know the true values of the item characteristics and whether they are identically distributed or not. 相似文献

15.

Using a Response Time–Based Expected A Posteriori Estimator to Control for Differential Speededness in Computerized Adaptive Test

Justin L. Kern Edison Choe 《应用心理检测》2021,45(5):361

This study investigates using response times (RTs) with item responses in a computerized adaptive test (CAT) setting to enhance item selection and ability estimation and control for differential speededness. Using van der Linden’s hierarchical framework, an extended procedure for joint estimation of ability and speed parameters for use in CAT is developed following van der Linden; this is called the joint expected a posteriori estimator (J-EAP). It is shown that the J-EAP estimate of ability and speededness outperforms the standard maximum likelihood estimator (MLE) of ability and speededness in terms of correlation, root mean square error, and bias. It is further shown that under the maximum information per time unit item selection method (MICT)—a method which uses estimates for ability and speededness directly—using the J-EAP further reduces average examinee time spent and variability in test times between examinees above the resulting gains of this selection algorithm with the MLE while maintaining estimation efficiency. Simulated test results are further corroborated with test parameters derived from a real data example. 相似文献

16.

The whole is indeed more than the sum of its parts: perceptual averaging in the absence of individual item representation

Corbett JE Oriet C 《Acta psychologica》2011,138(2):289-301

We tested Ariely's (2001) proposal that the visual system represents the overall statistical properties of sets of objects against alternative accounts of rapid averaging involving sub-sampling strategies. In four experiments, observers could rapidly extract the mean size of a set of circles presented in an RSVP sequence, but could not reliably identify individual members. Experiment 1 contrasted performance on a member identification task with performance on a mean judgment task, and showed that the tasks could be dissociated based on whether the test probe was presented before or after the sequence, suggesting that member identification and mean judgment are subserved by different mechanisms. In Experiment 2, we confirmed that when given a choice between a probe corresponding to the mean size of the set and a foil corresponding to the mean of the smallest and largest items only, the former is preferred to the latter, even when observers are explicitly instructed to average only the smallest and largest items. Experiment 3 showed that a test item corresponding to the mean size of the set could be reliably discriminated from a foil but the largest item in the set, differing by an equivalent amount, could not. In Experiment 4, observers rejected test items dissimilar to the mean size of the set in a member identification task, favoring test items that corresponded to the mean of the set over items that were actually shown. These findings suggest that mean representation is accomplished without explicitly encoding individual items. 相似文献

17.

A simple scoring weight for test items and its reliability

J. P. Guilford 《Psychometrika》1941,6(6):367-374

It is pointed out that the scoring weights for test items should be approximations to regression-equation weights. For this reason any estimate of reliability of the weight should not be permitted to influence the size of the weight but should be used in determining the limit of acceptability of an item. A simple approximation weight is recommended for general use, and anabac is provided for the estimation of it when the correlation between item and criterion is the phi coefficient. A formula for the standard error of this weight is derived and tables of significant and very significant weights are presented in terms of deviations from the median weight. 相似文献

18.

A method of score conversion through item statistics

Frances Swineford Chung-Teh Fan 《Psychometrika》1957,22(2):185-188

A method is presented for converting the scores on one form of a test to those on another form of the same test. The method is particularly applicable to the case where each form has been administered to a different group and the only link between the two forms is a subset of items common to both. The proposed method, called theitem method of conversion, has been applied to several tests for which other methods of conversion are available for comparison. The necessary data are limited to tests for which the total score is the criterion for item analyses. The method gives highly satisfactory results for all the tests to which it has been applied, particularly when the two groups are rather different, in which case the delta method (a different item method) is inappropriate.The authors are only two of a group, including W. H. Angoff, F. M. Lord, and M. K. Schultz, all of whom have made important contributions to this paper. 相似文献

19.

分类视角下认知诊断测验项目区分度指标及应用

汪文义宋丽红丁树良《心理科学》2018,(2):453-458

在认知诊断中还没有指标能在无作答数据情况下直接评价项目的属性分类准确率或属性判准率。项目水平上的属性分类准确率,与项目属性向量、项目参数、先验分布和作答反应等有关。综合各个影响因素定义了项目水平上的属性期望分类准确率指标,并将其用于组卷。模拟研究显示：新指标可十分准确地评价项目的属性判准率,新指标对于项目筛选十分重要;以模式分类准确率为评价指标,基于新指标的组卷方法与经典的组卷方法表现相当。相似文献

20.

基于GPCM的计算机自适应测验选题策略比较 总被引：1，自引：0，他引：1

刘珍丁树良林海菁《心理学报》2008,40(5):618-625

选题策略是计算机自适应测验（Computerized Adaptive Testing , CAT）研究的一项重要内容,它的好坏直接关系到考试的信度、效度及考试的安全性。CAT的许多研究与应用,都建立在0-1二级评分模型基础上,对多级评分CAT的选题策略的研究很少报导。目前国内虽已开展了基于GRM的CAT研究,但基于GPCM的CAT的研究尚未见有关报道。本文通过计算机模拟程序,对基于拓广分部评分模型(Generalized Partial Credit Model, GPCM)下的CAT的四种选题策略在多种情况下进行了比较研究。研究结果表明：被试能力呈正态分布时,选题策略的使用效果与项目步骤参数分布有很大的关系。（1）项目步骤参数均服从正态分布时,采用能力与项目步骤参数匹配选题策略效果最佳;（2）项目步骤参数均服从均匀分布时,能力与项目步骤参数平均数匹配选题策略效果最佳相似文献