首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Methods to determine the direction of a regression line, that is, to determine the direction of dependence in reversible linear regression models (e.g., xy vs. yx), have experienced rapid development within the last decade. However, previous research largely rested on the assumption that the true predictor is measured without measurement error. The present paper extends the direction dependence principle to measurement error models. First, we discuss asymmetric representations of the reliability coefficient in terms of higher moments of variables and the attenuation of skewness and excess kurtosis due to measurement error. Second, we identify conditions where direction dependence decisions are biased due to measurement error and suggest method of moments (MOM) estimation as a remedy. Third, we address data situations in which the true outcome exhibits both regression and measurement error, and propose a sensitivity analysis approach to determining the robustness of direction dependence decisions against unreliably measured outcomes. Monte Carlo simulations were performed to assess the performance of MOM-based direction dependence measures and their robustness to violated measurement error assumptions (i.e., non-independence and non-normality). An empirical example from subjective well-being research is presented. The plausibility of model assumptions and links to modern causal inference methods for observational data are discussed.  相似文献   

2.
When several test forms are used interchangeably, they will not in practice, despite all efforts, be rigorously parallel. A definition of the standard error of measurement appropriate for this type of situation can be provided; however, it will be different from the definition in classical test theory. Appropriate formulas for the standard error of measurement and for other related quantities under both definitions are derived and compared; also sample statistics for estimating these quantities.The writer is indebted to Lee Cronbach and Julian Stanley for many helpful suggestions for revising a draft of this paper. Part of the work was carried out while the writer was Brittingham Visiting Professor at the University of Wisconsin. This work was supported in part by contract Nonr-2752(00) between the Office of Naval Research and Educational Testing Service. Reproduction in whole or in part for any purpose of the United States Government is permitted. The termnominally parallel was suggested by Lee Cronbach.  相似文献   

3.
The author compared simulations of the “true” null hypothesis (z) test, in which ò was known and fixed, with the t test, in which s, an estimate of ò, was calculated from the sample because the t test was used to emulate the “true” test. The true null hypothesis test bears exclusively on calculating the probability that a sample distance (mean) is larger than a specified value. The results showed that the value of t was sensitive to sampling fluctuations in both distance and standard error. Large values of t reflect small standard errors when n is small. The value of t achieves sensitivity primarily to distance only when the sample sizes are large. One cannot make a definitive statement about the probability or “significance” of a distance solely on the basis of the value of t.  相似文献   

4.
Cognitive abilities are thought to represent temporally stable constructs, however, accumulating evidence suggests that effects of the measurement situation could affect its measurement (e.g., effects of test motivation, stress level). The present study modeled these effects explicitly in a latent variables approach. In contrast to previous studies, we investigated participants (job candidates) in repeated high‐stakes settings (N = 188). We found that cognitive ability measurements in high‐stakes settings not only reflect a stable latent trait and random measurement error, but also systematic effects of the test setting. Our results support the application of cognitive ability tests in organizational contexts but have implications for its use in applied settings such as personnel selection.  相似文献   

5.
We use classical test theory (CTT) and item response theory (IRT) methodologies to examine the psychometric and measurement properties of an instrument designed to assess sexual orientation harassment among military personnel (N?=?71,989). CTT analyses indicated that items were unidimensional and exhibited adequate levels of reliability. IRT analyses demonstrated that the items functioned similarly and exhibited appropriate levels of item discrimination. However, the analyses also suggested that the sensitivity of the items may be limited. Differential test functioning analyses provided evidence of the measurement equivalence of the instrument across male and female respondents. The findings provide support for the psychometric properties and measurement equivalence of the instrument for measuring sexual orientation harassment among male and female military personnel. We discuss the implications of our findings for future research on sexual orientation harassment in the workplace.  相似文献   

6.
As usually interpreted, the standard error of measurement is assumed to be constant throughout the test-score range. In this investigation the standard error of measurement was assumed to be not higher than a second-degree function of the test score. By conceiving a test score to be made up of the scores on two parallel tests, an equation was derived for predicting the standard error of measurement from the test score. In the derivation the corresponding first four moments of the score distributions for the parallel tests were assumed to be identical, and certain errors of estimate involved in predicting the second test score from the first were assumed to be uncorrelated with powers of the score on the first test. An empirical verification was carried out, using nine synthetic tests and a 1000-case sample, and showed good agreement between predicted and observed results. The findings indicated that the standard error of measurement was constant only for a symmetrical, mesokurtic distribution of scores.This study was carried out while the author was a National Research Council Predoctoral Fellow in psychology at Princeton University.The author wishes to express his appreciation for the guidance given by his thesis adviser, Professor Harold Gulliksen. He wishes also to acknowledge his gratitude to the Educational Testing Service for extensive assistance in the empirical phase of the study, and to Dr. Ledyard Tucker for suggesting efficient methods of handling special computational problems.  相似文献   

7.
Random effects meta‐regression is a technique to synthesize results of multiple studies. It allows for a test of an overall effect, as well as for tests of effects of study characteristics, that is, (discrete or continuous) moderator effects. We describe various procedures to test moderator effects: the z, t, likelihood ratio (LR), Bartlett‐corrected LR (BcLR), and resampling tests. We compare the Type I error of these tests, and conclude that the common z test, and to a lesser extent the LR test, do not perform well since they may yield Type I error rates appreciably larger than the chosen alpha. The error rate of the resampling test is accurate, closely followed by the BcLR test. The error rate of the t test is less accurate but arguably tolerable. With respect to statistical power, the BcLR and t tests slightly outperform the resampling test. Therefore, our recommendation is to use either the resampling or the BcLR test. If these statistics are unavailable, then the t test should be used since it is certainly superior to the z test.  相似文献   

8.
In a pre‐test–post‐test cluster randomized trial, one of the methods commonly used to detect an intervention effect involves controlling pre‐test scores and other related covariates while estimating an intervention effect at post‐test. In many applications in education, the total post‐test and pre‐test scores, ignoring measurement error, are used as response variable and covariate, respectively, to estimate the intervention effect. However, these test scores are frequently subject to measurement error, and statistical inferences based on the model ignoring measurement error can yield a biased estimate of the intervention effect. When multiple domains exist in test data, it is sometimes more informative to detect the intervention effect for each domain than for the entire test. This paper presents applications of the multilevel multidimensional item response model with measurement error adjustments in a response variable and a covariate to estimate the intervention effect for each domain.  相似文献   

9.
In a variety of measurement situations, the researcher may wish to compare the reliabilities of several instruments administered to the same sample of subjects. This paper presents eleven statistical procedures which test the equality ofm coefficient alphas when the sample alpha coefficients are dependent. Several of the procedures are derived in detail, and numerical examples are given for two. Since all of the procedures depend on approximate asymptotic results, Monte Carlo methods are used to assess the accuracy of the procedures for sample sizes of 50, 100, and 200. Both control of Type I error and power are evaluated by computer simulation. Two of the procedures are unable to control Type I errors satisfactorily. The remaining nine procedures perform properly, but three are somewhat superior in power and Type I error control.A more detailed version of this paper is also available.  相似文献   

10.
The main question of the present investigation was whether the Life Regard Index (LRI) is an adequate instrument to study possible differences between young and elderly adults with regard to experienced meaning in life. Participants in this study were a group of 206 young adults (M= 17.8 years, 49.0% female) and a group of 373 elderly adults (M = 65.90 years, 58.8% female). Respondents completed a Dutch paper-and-pencil version of the LRI. The LISREL confirmatory factor analytic model was used to test for the equivalence of measurement and of structure of the instrument. Results show that in both age groups the items of the LRI were found to be distributed to the two a priori dimensions of meaning in life, Framework and Fulfillment. Only the factor loadings of the Framework items were invariant across both groups. Neither the error of measurement, nor the structure of the underlying concept were equivalent for young and elderly adults. Young adults were found to experience less meaning in life than the elderly.  相似文献   

11.
The use of sentence imitation tasks to measure the development of linguistic competence in children is discussed, and the drawbacks of such techniques noted, particularly in the case of adult subjects. A measurement based on results from an error recognition test is proposed. A set of variables from Welsh is described, together with an error recognition test designed around them. The results of the test seem to indicate that linguistic features that may superficially seem equally variable may have different statuses in the competence of speakers. This sort of test is felt to have a part to play in the measurement of linguistic competence.  相似文献   

12.
This paper is a presentation of an essential part of the sampling theory of the error variance and the standard error of measurement. An experimental assumption is that several equivalent tests with equal variances are available. These may be either final forms of the same test or obtained by dividing one test into several parts. The simple model of independent and normally distributed errors of measurement with zero mean is employed. No assumption is made about the form of the distributions of true and observed scores. This implies unrestricted freedom in defining the population. First, maximum-likelihood estimators of the error variance and the standard error of measurement are obtained, their sampling distributions given, and their properties investigated. Then unbiased estimators are defined and their distributions derived. The accuracy of estimation is given special consideration from various points of view. Next, rigorous statistical tests are developed to test hypotheses about error variances on the basis of one and two samples. Also the construction of confidence intervals is treated. Finally, Bartlett's test of homogeneity of variances is used to provide a multi-sample test of equality of error variances.  相似文献   

13.
The Thorndike model of test fairness has recently been revised and used to argue that cognitive ability tests are biased against certain groups of test‐takers because ability tests show larger mean differences across racial groups than do job performance measures. We discuss two critical factors that confound this new version of Thorndike's model, making it susceptible to false indications of test bias. Those factors are (a) measurement error (i.e., reliability) in both the predictor and criterion and (b) the Spearman–Jensen effect (i.e., the well‐documented effect that group differences in observed g‐saturated measures are directly proportional to the degree the manifest indicator reflects g). Finally, because the Spearman–Jensen effect is not well known within the applied literature, we present a brief simulation to better elucidate the implications of the Spearman–Jensen effect for personnel selection in general, and claims of bias in cognitive ability testing in particular.  相似文献   

14.
I compared the randomization/permutation test and theF test for a two-cell comparative experiment. I varied (1) the number of observations per cell, (2) the size of the treatment effect, (3) the shape of the underlying distribution of error and, (4) for cases with skewed error, whether or not the skew was correlated with the treatment. With normal error, there was little difference between the tests. When error was skewed, by contrast, the randomization test was more sensitive than theF test, and if the amount of skew was correlated with the treatment, the advantage for the randomization test was both large and positively correlated with the treatment. I conclude that, because the randomization test was never less powerful than theF test, it should replace theF test in routine work.  相似文献   

15.
For 25 years psychologists have measured systematic measurement bias in terms of regression lines. According to this traditional approach a test is an unbiased predictor of a criterion for all subgroups if all subgroups have identical Y regression lines (i.e., identical slopes and identical Y intercepts). This paper shows that the traditional model is fundamentally incorrect and identical Y regression lines are not expected to occur with an unbiased test in a testing situation in which one group score lower than another group on both the test and criterion. This is the case even if the test is perfectly reliable. The traditional model for measuring bias actually results in a consistent error or bias against groups which score lower than average on both the test and criterion. In practice this bias operates against minority groups. Tests now thought to be unbiased or even biased in favor of minority groups may in fact be biased against minority groups. A new model of test bias, which is based solely on measurement principles, is briefly introduced. In this model unbiased tests produce groups with identical test-criterion common-factor axes having a slope of S YC/S XC and with each axis intersecting the group centroids.  相似文献   

16.
Formulas for the standard error of measurement of three measures of change—simple difference scores, residualized difference scores, and the measure introduced by Tucker, Damarin, and Messick—are derived. Equating these formulas by pairs yields additional explicit formulas which provide a practical guide for determining the relative error of the three measures in any pretest-posttest design. The functional relationship between the standard error of measurement and the correlation between pretest and posttest observed scores remains essentially the same for each of the three measures despite variations in other test parameters (reliability coefficients, standard deviations), even when pretest and posttest errors of measurement are correlated.  相似文献   

17.
GREEN BF 《Psychometrika》1950,15(3):251-257
A procedure is proposed for testing the significance of group differences in the standard error of measurement of a psychological test. Wilks' criterion is used to assure that the tests used in ascertaining reliability and hence variance of errors of measurement may be assumed parallel for each group. Votaw's criterion may be used to check whether the test scores of all the groups have the same mean, variance, and covariance. It is possible, however, for the variance and reliability of the test to differ widely from group to group, so that Votaw's criterion is not satisfied even though the variance of errors of measurement stays relatively constant. For this case a modification of Neyman and Pearson's criterion is developed to test agreement among standard errors of measurement despite group differences in mean, variance, and reliability of the test.The author wishes to acknowledge the helpful criticisms of Dr. Harold Gulliksen, who suggested the problem.  相似文献   

18.
The study used multiple-group confirmatory factor analysis (CFA) and multiple indicators multiple causes (MIMIC) procedures to examine the measurement and construct equivalencies for father and mother ratings of ADHD symptoms, recoded as binary scores. Fathers (N = 387) and mothers (N = 411) rated their primary school-aged children on the Disruptive Behavior Rating Scale (Barkley & Murphy, 1998). For the multiple-group CFA analyses, the results involving differences in practical fit indices supported full measurement and construct equivalencies, whereas the chi-square difference test showed lack of equivalency in five symptoms for factor loadings, four symptoms for error variance, and the variance and mean scores for the hyperactivity-impulsivity factor. For the MIMIC analyses, six symptoms lacked equivalency for thresholds. These findings extend existing data in this area. The theoretical, psychometric and clinical implications of the findings are discussed.  相似文献   

19.
According to Wollack and Schoenig (2018, The Sage encyclopedia of educational research, measurement, and evaluation. Thousand Oaks, CA: Sage, 260), benefiting from item preknowledge is one of the three broad types of test fraud that occur in educational assessments. We use tools from constrained statistical inference to suggest a new statistic that is based on item scores and response times and can be used to detect examinees who may have benefited from item preknowledge for the case when the set of compromised items is known. The asymptotic distribution of the new statistic under no preknowledge is proved to be a simple mixture of two χ2 distributions. We perform a detailed simulation study to show that the Type I error rate of the new statistic is very close to the nominal level and that the power of the new statistic is satisfactory in comparison to that of the existing statistics for detecting item preknowledge based on both item scores and response times. We also include a real data example to demonstrate the usefulness of the suggested statistic.  相似文献   

20.
Background. Researchers often test people before and after some treatment and compare these scores with a control group. Sometimes it is not possible to allocate people into conditions randomly, which means the initial scores for the two groups may differ. There are two main approaches: t test on the gain scores and ANCOVA partialling out the initial scores. Lord (1967) showed that these can lead to different conclusions. This is an often‐discussed paradox in psychology and education. Aims. The reasons why these approaches can lead to different conclusions, the assumptions that each approach makes and how the approaches relate to group allocation, are discussed Methods. Three sets of simulations are reported that investigate the relationships among effect size, group allocation, measurement error and Lord's paradox. Conclusions. Recommendations are given that stress careful examination of the research questions, sampling and allocation of participants and graphing the data. ANCOVA is appropriate when allocation is based on the initial scores, t test can be appropriate if allocation is associated non‐causally with the initial scores, but often neither approach provides adequate results.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号