This paper describes a relationship between the variance-covariance matrix of test items and Woodbury's concept of the standard length of a test. An index of item-test relationship is described in standard length terms. The sum of these indices for the items in a test is equal to the square of Jackson's coefficient of sensitivity.  相似文献   

HORST P 《Psychometrika》1948,13(3):125-134
A battery of pencil-and-paper tests is commonly used for predicting a single criterion. If the score on each test is the number of correct answers, the composite battery score would normally be the sum of the weighted test scores, where the weights are the raw score regression weights. Knowing the reliability of each test, it is possible to alter the lengths of the tests in a manner such that the weights will all be equal. The composite battery score would then simply be the total number of items answered correctly and scoring would be greatly simplified. Such simplification is particularly desirable where the volume of testing is large. Section I of the article outlines the procedure for altering the lengths of the tests, and Section II gives a proof of the method.  相似文献   

Reaction times were examined for magnitude estimates of line length. In the first two experiments, reaction times increased linearly with judged length. This result is consistent with the hypothesis the judgments are made by laying off a mental image of the standard along the line to be judged. The slope of the function relating judged length to reaction time was not affected by the length of the standard line, suggesting that the rate at which the image of the standard is laid off is not a function of the length of the standard. Reaction time also increased linearly with judged length when subjects judged line length when the standard of 1 in. was suggested but not provided as well as when no standard was suggested. The hypothesized laying-off process was compared to other cognitive manipulations, such as mental rotation and size scaling. Equivalence of judgments based on the representation of the standard in perceptual memory and in imagination is discussed.  相似文献   

In some situations where reliability must be estimated it is impossible to divide the measuring instrument into more than two separately scoreable parts. When this is the case, the parts may be homogeneous in content but clearly unequal in length. The resultant scores will not be essentially τ-equivalent, and hence total test reliability cannot be satisfactorily estimated via Cronbach's coefficient alpha. Limitation on the number of parts rules out Kristof's three-part approach. A technique is developed for estimating reliability in such situations. The approach is shown to function very well when applied to five achievement tests.  相似文献   

Measures of effective test length are developed for speeded and power tests, which are independent of the number of items in the test or of the time required for administration. These measures are used in determining reliability for (1) speeded and power tests, where a separately timed short parallel form is administered in addition to the full-length test; (2) power tests, where a subset of items is imbedded within the total test, parallel to the total test; and (3) power tests, where the subset of items is correlated with the complementary parallel subset in the test.  相似文献   

The application of the Thematic Apperception Test to the assessment of motives has been heralded as an important milestone in personality psychology. However, although this approach is well established, there is at present no standard battery of cues for measuring the Big Three motives (achievement, affiliation, power). Furthermore, the extent to which scoring subcategories contribute to overall motive scores has been neglected. Our research with students and managers examined the effectiveness of picture cues in eliciting motive imagery and the prevalence of scoring subcategories within each motive scoring system. Results from 2 data sets comprising 547 men and women suggested that there were 3 cues that should be retained for future research and that motive scoring systems could be refined through removal of redundant subcategories. Further research is needed to systematically investigate the effectiveness of a standard battery of cues and the validity of revised motive scoring systems.  相似文献   

GREEN BF 《Psychometrika》1950,15(3):251-257
A procedure is proposed for testing the significance of group differences in the standard error of measurement of a psychological test. Wilks' criterion is used to assure that the tests used in ascertaining reliability and hence variance of errors of measurement may be assumed parallel for each group. Votaw's criterion may be used to check whether the test scores of all the groups have the same mean, variance, and covariance. It is possible, however, for the variance and reliability of the test to differ widely from group to group, so that Votaw's criterion is not satisfied even though the variance of errors of measurement stays relatively constant. For this case a modification of Neyman and Pearson's criterion is developed to test agreement among standard errors of measurement despite group differences in mean, variance, and reliability of the test.The author wishes to acknowledge the helpful criticisms of Dr. Harold Gulliksen, who suggested the problem.  相似文献   

HORST P 《Psychometrika》1949,14(2):79-88
If the lengths of the tests in a battery are altered, their intercorrelations and their validities or correlations with a criterion are also altered. Consequently, the multiple correlation of the battery with the criterion will also be altered. These changes are a function of the reliabilities of the tests. Suppose we have given from a set of experimental data (1) the time allowed for each test in the battery, (2) the reliability of each test, (3) the intercorrelations, and (4) the validities of all the tests. If we specify the over-all testing time we are willing to allow for the test in the future, we can determine the amount by which each test must be altered in order to give the maximum multiple correlation with the criterion. The method is presented, together with numerical examples and the mathematical proof.  相似文献   

When the reliability of test scores must be estimated by an internal consistency method, partition of the test into just 2 parts may be the only way to maintain content equivalence of the parts. If the parts are classically parallel, the Spearman-Brown formula may be validly used to estimate the reliability of total scores. If the parts differ in their standard deviations but are tau equivalent, Cronbach's alpha is appropriate. However, if the 2 parts are congeneric, that is, they are unequal in functional length or they comprise heterogeneous item types, a less well-known estimate, the Angoff-Feldt coefficient, is appropriate. Guidelines in terms of the ratio of standard deviations are proposed for choosing among Spearman-Brown, alpha, and Angoff-Feldt coefficients.  相似文献   

The concepts of multiple differential prediction and multiple absolute prediction are developed in earlier papers (2, 3). The problem of determining the optimal distribution of testing time for multiple differential prediction has been previously considered (4). This paper develops an analogous procedure for multiple absolute prediction. A numerical example illustrating the procedure is presented. The mathematical rationale underlying the procedure is given.This research was carried out under Contract Nonr-477(08) between the University of Washington and the Office of Naval Research. The computations were carried out by Robert Dear and Donald Mills. Much credit is due the typist, Elizabeth Cross. Supervision of both computational and editorial activities was provided by William Clemans. To each of these able contributors we are deeply grateful.  相似文献   

For the case of a single criterion a method is already available for determining the optimal distribution of testing time for a battery of predictors, assuming that intercorrelation, validity, and reliability data are available for predictors of arbitrary lengths. In this article a modification and generalization of the method is presented for the case of differential prediction involving a number of criterion variables. A numerical example is given to illustrate the method, after which the mathematical rationale is outlined.This research was carried out under Contract Nonr-477(08) between the University of Washington and the Office of Naval Research. Most of the computations were carried out by Robert Dear, Charlotte MacEwan, and Donald Mills. Much credit is due the typist, Elizabeth Cross. Supervision of both computational and editorial activities was provided by William Clemans. To each of these able contributors I am deeply grateful.  相似文献   

HORST P 《Psychometrika》1951,16(2):189-202
Having given a fixed amount of total testing time it is important to know how long each test in the battery should be so that the correlation of the battery with the criterion will be a maximum. The precise solution for the test lengths will depend on a particular set of conditions which may be specified. The writer has previously presented solutions for two sets of conditions. This article presents the solution for a third set of conditions. These are: (1) The total number of items or testing time is fixed. (2) The score is the total number of items correctly answered. (3) The test lengths are determined in such a way that the correlation of total score with the criterion is a maximum. The solutions for the two previous sets of conditions, together with the current set, are summarized. A set of experimental data is submitted to each solution and the three sets of results are compared.  相似文献   

Summary A very early student project undertaken by Friedrich Hegelmaier (1833–1906), published in German in 1852, is republished in English translation. Slight though the experimental work is, it nevertheless occupies a unique place in the history of experimental psychology. It is the source whence Fechner had the method of constant stimuli, a method that continued in use as the preferred psychophysical method, substantially in the form described here, for more than a century. The experiment is arguably the first experiment in the modern sense of a systematic preplanned body of observations and has the glaring faults that one would expect in a very first experiment. Finally, Hegelmaier suggests the use of two simultaneous tasks as a means to investigate human performance, a full hundred years before that idea was realized in practice. If only he had continued in experimental psychology!  相似文献   

This paper presents a contribution to the sampling theory of a set of homogeneous tests which differ only in length, test length being regarded as an essential test parameter. Observed variance-covariance matrices of such measurements are taken to follow a Wishart distribution. The familiar true score-and-error concept of classical test theory is employed. Upon formulation of the basic model it is shown that in a combination of such tests forming a “total” test, the singal-to-noise ratio of the components is additive and that the inverse of the population variance-covariance matrix of the component measures has all of its off-diagonal elements equal, regardless of distributional assumptions. This fact facilitates the subsequent derivation of a statistical sampling theory, there being at mostm + 1 free parameters whenm is the number of component tests. In developing the theory, the cases of known and unknown test lengths are treated separately. For both cases maximum-likelihood estimators of the relevant parameters are derived. It is argued that the resulting formulas will remain resonable even if the distributional assumptions are too narrow. Under these assumptions, however, maximum-likelihood ratio tests of the validity of the model and of hypotheses concerning reliability and standard error of measurement of the total test are given. It is shown in each case that the maximum-likelihood equations possess precisely one acceptable solution under rather natural conditions. Application of the methods can be effected without the use of a computer. Two numerical examples are appended by way of illustration. This research was supported in part by The National Institute of Child Health and Human Development, under Research Grant 1 PO1 HDO1762.  相似文献   

Many authors adhere to the rule that test reliabilities should be at least .70 or .80 in group research. This article introduces a new standard according to which reliabilities can be evaluated. This standard is based on the costs or time of the experiment and of administering the test. For example, if test administration costs are 7 % of the total experimental costs, the efficient value of the reliability is .93. If the actual reliability of a test is equal to this efficient reliability, the test size maximizes the statistical power of the experiment, given the costs. As a standard in experimental research, it is proposed that the reliability of the dependent variable be close to the efficient reliability. Adhering to this standard will enhance the statistical power and reduce the costs of experiments.  相似文献   

The Raven's standard progressive matrices (RSPM) is a 60-item test for measuring abstract reasoning, considered a nonverbal estimate of fluid intelligence, and often included in clinical assessment batteries and research on patients with cognitive deficits. The goal was to develop and apply a predictive model approach to reduce the number of items necessary to yield a score equivalent to that derived from the full scale. The approach is based on a Poisson predictive model. A parsimonious subset of items that accurately predicts the total score was sought, as was a second nonoverlapping alternate form for repeated administrations. A split sample was used for model fitting and validation, with cross-validation to verify results. Using nine RSPM items as predictors, correlations of .9836 and .9782 were achieved for the reduced forms and .9063 and .8978 for the validation data. Thus, a 9-item subset of RSPM predicts the total score for the 60-item scale with good accuracy. A comparison of psychometric properties between 9-item forms, a published 30-item form, and the 60-item set is presented. The two 9-item forms provide a 75% administration time savings compared with the 30-item form, while achieving similar item- and test-level characteristics and equal correlations to 60-item based scores.  相似文献   

A paradoxical implication of Kraemer's expression for the large-sample standard error of Brogden's form of the biserial correlation is identified, and a new expression is given which does not imply the paradox. However, numerical evidence is presented which calls into question the correctness of the expression.  相似文献   

