首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Babitz  Milton  Keys  Noel 《Psychometrika》1940,5(4):283-288
It is noted that the average inter-item correlation, which represents the internal consistency of a test, yields a unique estimate of test reliability. A close approximation to this average is given by a formula which requires the correlation of each item with the total score and the standard deviation of each item. The formula is especially useful in those instances where the number of items is small and where the variation in item sigmas should not be neglected.  相似文献   

2.
On the mean and variance of the tetrachoric correlation coefficient   总被引:1,自引:0,他引:1  
Estimates of the mean and standard deviation of the tetrachoric correlation are compared with their expected values in several 2 × 2 tables. Significant bias in the mean is found when the minimum cell frequency is less than 5. Three formulas for the standard deviation are compared and guidelines given for their use.This research was performed when the first author was on leave at the University of California at Los Angeles and was supported in part by NIH Special Research Resources Grant RR-3. The second author was also supported by NIH Fellowship 5 F22 GM00328-02.  相似文献   

3.
Guttman's principal components for the weighting system are the item scoring weights that maximize the generalized Kuder-Richardson reliability coefficient. The principal component for any item is effectively the same as the factor loading of the item divided by the item standard deviation, the factor loadings being obtained from an ordinary factor analysis of the item intercorrelation matrix.  相似文献   

4.
The biserial correlation between an item and the total test of which the item is a part tends to be misleadingly high when used in item analysis, since the item is included in the total test. Two formulas with correction for this overlap are derived and compared with Zubin's and Guilford's formulas. One of the new coefficients is invariant to test length.  相似文献   

5.
各种心理调查、心理实验中, 数据的缺失随处可见。由于数据缺失, 给概化理论分析非平衡数据的方差分量带来一系列问题。基于概化理论框架下, 运用Matlab 7.0软件, 自编程序模拟产生随机双面交叉设计p×i×r缺失数据, 比较和探讨公式法、REML法、拆分法和MCMC法在估计各个方差分量上的性能优劣。结果表明:(1) MCMC方法估计随机双面交叉设计p×i×r缺失数据方差分量, 较其它3种方法表现出更强的优势; (2) 题目和评分者是缺失数据方差分量估计重要的影响因素。  相似文献   

6.
Abstract.— Criteria for measures of attitudinal polarization, i. e., degree of opposition among people on a specific issue, are proposed and some formulas, including the standard deviation, are evaluated in relation to the criteria. The formulas were also tested on empirical data with respect to level, dispersion and agreement of received values. The measures on the whole showed a high degree of agreement. There clearly exist instances where the standard deviation is not an adequate measure of attitudinal polarization. Some guidelines are given for the choice of constant values in one of the formulas.  相似文献   

7.
While the Angoff (1971) is a commonly used cut score method, critics ( Berk, 1996; Impara & Plake, 1997 ) argue the Angoff places too‐high cognitive demands on raters. In response to criticisms of the Angoff, a number of modifications to the method have been proposed. Some suggested Angoff modifications include using an iterative rating process, presenting judges with normative data about item performance, revising the rating judgment into a Yes/No decision, assigning relative weights to dimensions within a test, and using item response theory in setting cut scores. In this study, subject matter expert raters were provided with a ‘difficulty anchored’ rating scale to use while making Angoff ratings; this scale can be viewed as a variation of the Angoff normative data modification. The rating scale presented test items having known p‐values as anchors, and served as a simple means of providing normative information to guide the Angoff rating process. Results are discussed regarding reliability of the mean Angoff rating (.73) and the correlation of mean Angoff ratings with item difficulty (observed r ranges from .65 to .73).  相似文献   

8.
In many educational tests which involve constructed responses, a traditional test score is obtained by adding together item scores obtained through holistic scoring by trained human raters. For example, this practice was used until 2008 in the case of GRE®General Analytical Writing and until 2009 in the case of TOEFL® iBT Writing. With use of natural language processing, it is possible to obtain additional information concerning item responses from computer programs such as e‐rater®. In addition, available information relevant to examinee performance may include scores on related tests. We suggest application of standard results from classical test theory to the available data to obtain best linear predictors of true traditional test scores. In performing such analysis, we require estimation of variances and covariances of measurement errors, a task which can be quite difficult in the case of tests with limited numbers of items and with multiple measurements per item. As a consequence, a new estimation method is suggested based on samples of examinees who have taken an assessment more than once. Such samples are typically not random samples of the general population of examinees, so that we apply statistical adjustment methods to obtain the needed estimated variances and covariances of measurement errors. To examine practical implications of the suggested methods of analysis, applications are made to GRE General Analytical Writing and TOEFL iBT Writing. Results obtained indicate that substantial improvements are possible both in terms of reliability of scoring and in terms of assessment reliability.  相似文献   

9.
The paper addresses three neglected questions from IRT. In section 1, the properties of the “measurement” of ability or trait parameters and item difficulty parameters in the Rasch model are discussed. It is shown that the solution to this problem is rather complex and depends both on general assumptions about properties of the item response functions and on assumptions about the available item universe. Section 2 deals with the measurement of individual change or “modifiability” based on a Rasch test. A conditional likelihood approach is presented that yields (a) an ML estimator of modifiability for given item parameters, (b) allows one to test hypotheses about change by means of a Clopper-Pearson confidence interval for the modifiability parameter, or (c) to estimate modifiability jointly with the item parameters. Uniqueness results for all three methods are also presented. In section 3, the Mantel-Haenszel method for detecting DIF is discussed under a novel perspective: What is the most general framework within which the Mantel-Haenszel method correctly detects DIF of a studied item? The answer is that this is a 2PL model where, however, all discrimination parameters are known and the studied item has the same discrimination in both populations. Since these requirements would hardly be satisfied in practical applications, the case of constant discrimination parameters, that is, the Rasch model, is the only realistic framework. A simple Pearsonx 2 test for DIF of one studied item is proposed as an alternative to the Mantel-Haenszel test; moreover, this test is generalized to the case of two items simultaneously studied for DIF.  相似文献   

10.
11.
This study investigates using response times (RTs) with item responses in a computerized adaptive test (CAT) setting to enhance item selection and ability estimation and control for differential speededness. Using van der Linden’s hierarchical framework, an extended procedure for joint estimation of ability and speed parameters for use in CAT is developed following van der Linden; this is called the joint expected a posteriori estimator (J-EAP). It is shown that the J-EAP estimate of ability and speededness outperforms the standard maximum likelihood estimator (MLE) of ability and speededness in terms of correlation, root mean square error, and bias. It is further shown that under the maximum information per time unit item selection method (MICT)—a method which uses estimates for ability and speededness directly—using the J-EAP further reduces average examinee time spent and variability in test times between examinees above the resulting gains of this selection algorithm with the MLE while maintaining estimation efficiency. Simulated test results are further corroborated with test parameters derived from a real data example.  相似文献   

12.
A method of estimating item response theory (IRT) equating coefficients by the common-examinee design with the assumption of the two-parameter logistic model is provided. The method uses the marginal maximum likelihood estimation, in which individual ability parameters in a common-examinee group are numerically integrated out. The abilities of the common examinees are assumed to follow a normal distribution but with an unknown mean and standard deviation on one of the two tests to be equated. The distribution parameters are jointly estimated with the equating coefficients. Further, the asymptotic standard errors of the estimates of the equating coefficients and the parameters for the ability distribution are given. Numerical examples are provided to show the accuracy of the method.  相似文献   

13.
考试评分缺失数据较为常见,如何有效利用现有数据进行统计分析是个关键性问题。在考试评分中,题目与评分者对试卷得分的影响不容忽视。根据概化理论原理,按考试评分规则推导出含有缺失数据双侧面交叉设计(p×i×r)方差分量估计公式,用Matlab7.0软件模拟多组缺失数据,验证此公式的有效性。结果发现:(1)推导出的公式较为可靠,估计缺失数据的方差分量偏差相对较小,即便数据缺失率达到50%以上,公式仍能对方差分量进行较为准确地估计;(2)题目数量对概化理论缺失数据方差分量的估计影响最大,评分者次之,当题目和评价者数量分别为6和5时,公式能够趋于稳定地估计;(3)学生数量对各方差分量的估计影响较小,无论是小规模考试还是大规模考试,概化理论估计缺失数据的多个方差分量结果相差不大。  相似文献   

14.
The relations among alternative parameterizations of the binary factor analysis (FA) model and two-parameter logistic (2PL) item response theory (IRT) model have been thoroughly discussed in literature. However, the conversion formulas widely available are mainly for transforming parameter estimates from one parameterization to another. There is a lack of discussion about the standard error (SE) conversion among different parameterizations, when SEs of IRT model parameters are often of immediate interest to practitioners. This article provides general formulas for computing the SEs of transformed parameter values, when these parameters are transformed from FA to IRT models. These formulas are suitable for unidimensional 2PL, multidimensional 2PL, and bi-factor 2PL models. A simulation study is conducted to verify the formula by providing empirical evidence. A real data example is given in the end for an illustration.  相似文献   

15.
16.
This article presents normative data for the Rey Auditory and Verbal Learning Test (RAVLT). A representative sample of 390 healthy young adults aged between 18 and 34 living within the Sydney metropolitan area, Australia, completed Form AB of the RAVLT as part of the Macquarie University Neurological Normative Study. Retest data were collected from a subsample of 98 participants after an interval of 1 year. Norms were derived for commonly used measures of the RAVLT and are presented for the whole sample as well as separately for males and females with different levels of education. Age was not found to impact significantly on test performance for this group of young adults, and therefore age‐adjusted norms are not provided. An excel program to calculate RAVLT standard scores (mean of 10 and standard deviation of 3) can be downloaded from http://www.psy.mq.edu.au/RAVLT . Poor test–retest reliability raises concerns about the use of the RAVLT in clinical diagnosis.  相似文献   

17.
Global memory models are evaluated by using data from recognition memory experiments. For recognition, each of the models gives a value of familiarity as the output from matching a test item against memory. The experiments provide ROC (receiver operating characteristic) curves that give information about the standard deviations of familiarity values for old and new test items in the models. The experimental results are consistent with normal distributions of familiarity (a prediction of the models). However, the results also show that the new-item familiarity standard deviation is about 0.8 that of the old-item familiarity standard deviation and independent of the strength of the old items (under the assumption of normality). The models are inconsistent with these results because they predict either nearly equal old and new standard deviations or increasing values of old standard deviation with strength. Thus, the data provide the basis for revision of current models or development of new models.  相似文献   

18.
Asymptotic formulas are derived for the bias in the maximum likelihood estimators of the item parameters in the logistic item response model when examinee abilities are known. Numerical results are given for a typical verbal test for college admission.  相似文献   

19.
学习困难儿童诊断量表常模制定   总被引:8,自引:0,他引:8  
邵志芳  陈国鹏  单阳 《心理科学》2000,23(2):169-171
"学习困难检查表"在上海范围内向1067名被试施测后,本研究对所获数据进行了初步分析,得到(1)各变量原始数据的均数和标准差.(2)各变量原始数据有显著年龄差异和性别差异.(3)信效度检验结果基本符合心理测量学要求.(4)制定了上海地区常模.  相似文献   

20.
As the Internet became widely used, problems associated with its excessive use became increasingly apparent. Although for the assessment of these problems several models and related questionnaires have been elaborated, there has been little effort made to confirm them. The aim of the present study was to test the three-factor model of the previously created Problematic Internet Use Questionnaire (PIUQ) by data collection methods formerly not applied (off-line group and face-to-face settings), on the one hand, and by testing on different age groups (adolescent and adult representative samples), on the other hand. Data were collected from 438 high-school students (44.5 percent boys; mean age: 16.0 years; standard deviation=0.7 years) and also from 963 adults (49.9 percent males; mean age: 33.6 years; standard deviation=11.8 years). We applied confirmatory factor analysis to confirm the measurement model of problematic Internet use. The results of the analyses carried out inevitably support the original three-factor model over the possible one-factor solution. Using latent profile analysis, we identified 11 percent of adults and 18 percent of adolescent users characterized by problematic use. Based on exploratory factor analysis, we also suggest a short form of the PIUQ consisting of nine items. Both the original 18-item version of PIUQ and its short 9-item form have satisfactory reliability and validity characteristics, and thus, they are suitable for the assessment of problematic Internet use in future studies.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号