首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Allard G  Faust D 《Assessment》2000,7(2):119-129
Given the paucity of previous research, we examined the occurrence of scoring error on widely used objective personality tests and examined its possible relation to two factors: scoring procedure complexity (SPC) and commitment to accuracy (CTA). We double-checked the scoring of three tests (MMPI, Beck Depression Inventory, Spielberger State/Trait Anxiety Inventory) across three settings. Each of the tests were misscored at a surprisingly high rate in at least one setting, and some such errors altered major interpretive implications. Tests of higher SPC showed greater error rates, but high CTA greatly reduced the occurrence of error across levels of SPC. Unexpected sources of error were also uncovered, such as commercial computer scoring errors and disagreement in scoring standards among test publishers. Practical suggestions for improving scoring accuracy are offered.  相似文献   

2.
Experimentwise error rates of the type proposed by Ryan (1959) are discussed and contrasted with anew measure of the likelihood that the results of a series of significance tests are Type I errors. This new measure, the Alpha Percentage (a%), shares the advantages of experimentwise error rates over individual alpha levels in reducing Type I errors in communication research, but the Alpha Percentage has much greater power than currently used experimentwise error rates to detect significant effects. Four arguments against the use of experimentwise error procedures are discussed and EW, EP, and a% rates are reported for Communication Monographs and Human Communication Research.  相似文献   

3.
It is well-known that for normally distributed errors parametric tests are optimal statistically, but perhaps less well-known is that when normality does not hold, nonparametric tests frequently possess greater statistical power than parametric tests, while controlling Type I error rate. However, the use of nonparametric procedures has been limited by the absence of easily performed tests for complex experimental designs and analyses and by limited information about their statistical behavior for realistic conditions. A Monte Carlo study of tests of predictor subsets in multiple regression analysis indicates that various nonparametric tests show greater power than the F test for skewed and heavy-tailed data. These nonparametric tests can be computed with available software.  相似文献   

4.
A statistical model for combining p values from multiple tests of significance is used to define rejection and acceptance regions for two-stage and three-stage sampling plans. Type I error rates, power, frequencies of early termination decisions, and expected sample sizes are compared. Both the two-stage and three-stage procedures provide appropriate protection against Type I errors. The two-stage sampling plan with its single interim analysis entails minimal loss in power and provides substantial reduction in expected sample size as compared with a conventional single end-of-study test of significance for which power is in the adequate range. The three-stage sampling plan with its two interim analyses introduces somewhat greater reduction in power, but it compensates with greater reduction in expected sample size. Either interim-analysis strategy is more efficient than a single end-of-study analysis in terms of power per unit of sample size.  相似文献   

5.
Studies of graduate students learning to administer the Wechsler scales have generally shown that training is not associated with the development of scoring proficiency. Many studies report on the reduction of aggregated administration and scoring errors, a strategy that does not highlight the reduction of errors on subtests identified as most prone to error. This study evaluated the development of scoring proficiency specifically on the Wechsler (WISC-IV and WAIS-III) Vocabulary, Comprehension, and Similarities subtests during training by comparing a set of 'early test administrations' to 'later test administrations.' Twelve graduate students enrolled in an intelligence-testing course participated in the study. Scoring errors (e.g., incorrect point assignment) were evaluated on the students' actual practice administration test protocols. Errors on all three subtests declined significantly when scoring errors on 'early' sets of Wechsler scales were compared to those made on 'later' sets. However, correcting these subtest scoring errors did not cause significant changes in subtest scaled scores. Implications for clinical instruction and future research are discussed.  相似文献   

6.
A hybrid procedure for number correct scoring is proposed. The proposed scoring procedure is based on both classical true-score theory (CTT) and multidimensional item response theory (MIRT). Specifically, the hybrid scoring procedure uses test item weights based on MIRT and the total test scores are computed based on CTT. Thus, what makes the hybrid scoring method attractive is that this method accounts for the dimensionality of the test items while test scores remain easy to compute. Further, the hybrid scoring does not require large sample sizes once the item parameters are known. Monte Carlo techniques were used to compare and contrast the proposed hybrid scoring method with three other scoring procedures. Results indicated that all scoring methods in this study generated estimated and true scores that were highly correlated. However, the hybrid scoring procedure had significantly smaller error variances between the estimated and true scores relative to the other procedures.  相似文献   

7.
The important assumption of independent errors should be evaluated routinely in the application of interrupted time-series regression models. The two most frequently recommended tests of this assumption [Mood's runs test and the Durbin-Watson (D-W) bounds test] have several weaknesses. The former has poor small sample Type I error performance and the latter has the bothersome property that results are often declared to be "inconclusive." The test proposed in this article is simple to compute (special software is not required), there is no inconclusive region, an exact p-value is provided, and it has good Type I error and power properties relative to competing procedures. It is shown that these desirable properties hold when design matrices of a specified form are used to model the response variable. A Monte Carlo evaluation of the method, including comparisons with other tests (viz., runs, D-W bounds, and D-W beta), and examples of application are provided.  相似文献   

8.
A coaching strategy to decrease errors in swimming strokes with swimmers who had not improved under "standard" coaching procedures was investigated using a multiple baseline design across subjects and swimming strokes. The procedure resulted in a large decrease in errors on swimming strokes during sessions in a training pool. Stimulus generalization of improved performance to normal practice conditions in the regular pool was observed with all but one swimmer. This improvement was maintained during two maintenance phases lasting approximately 2 weeks, as well as under standard coaching conditions during at least a 2-week follow-up. For two swimmers, error rates on one of the strokes showed a gradual increase between the third and fifth week of follow-up, but brief remedial prompting sessions immediately corrected their performance. Some beneficial response generalization to other components of the stroke being trained was observed, but no improvements were found on untrained strokes. The error correction package did not disrupt practice, require excessive amounts of the coach's time, or necessitate the use of cumbersome apparatus. In addition, the coach and the swimmers considered the procedures to be effective, and expressed their willingness to participate in them again in the future.  相似文献   

9.
How do stimulus size and item number relate to the magnitude and direction of error on center estimation and line cancellation tests? How might this relationship inform theories concerning spatial neglect? These questions were addressed by testing twenty patients with right hemisphere lesions, eleven with left hemisphere lesions and eleven normal control subjects on multiple versions of center estimation and line cancellation tests. Patients who made large errors on these tests also demonstrated an optimal or pivotal stimulus value, i.e., a particular size center estimation test or number of lines on cancellation that either minimized error magnitude relative to other size stimuli (optimal) or marked the boundary between normal and abnormal performance (pivotal). Patients with right hemisphere lesions made increasingly greater errors on the center estimation test as stimuli were both larger and smaller than the optimal value, whereas those with left hemisphere lesions made greater errors as stimuli were smaller than a pivotal value. In normal subjects, the direction of errors on center estimation stimuli shifted from the right of true center to the left as stimuli decreased in size (i.e., the crossover effect). Right hemisphere lesions exaggerated this effect, whereas left hemisphere lesions diminished and possibly reversed the direction of crossover. Error direction did not change as a function of stimulus value on cancellation tests. The demonstration of optimal and pivotal stimulus values indicates that performances on center estimation and cancellation tests in neglect are only relative to the stimuli used. In light of other studies, our findings indicate that patients with spatial neglect grossly overestimate the size of small stimuli and underestimate the size of large stimuli, that crossover represents an “apparent” shift in error direction that actually results from normally occurring errors in size perception, and that the left hemisphere is specialized for one aspect of size estimation, whereas the right performs dual roles.  相似文献   

10.
Speech samples were obtained that were analyzed for voice onset time (VOT) for 40 nondemented English speaking subjects, 20 with mild and 20 with moderate Parkinson's disease. Syntax comprehension and cognitive tests were administered to these subjects in the same test sessions. VOT disruptions for stop consonants in syllable initial position, similar to those noted for Broca's aphasia, occurred for nine subjects. Longer response times and errors in the comprehension of syntax as measured by the Rhode Island Test of Sentence Comprehension (RITLS) also occurred for these subjects. Anovas indicate that the VOT overlap subjects had significantly higher syntax error rates and longer response times on the RITLS than the VOT nonoverlap subjects--F(1, 70) = 12.38, p less than 0.0008; F(1, 70) = 7.70, p less than 0.007, respectively. The correlation between the number of VOT timing errors and the number of syntax errors was significant. (r = 0.6473, p less than 0.01). VOT overlap subjects also had significantly higher error rates in cognitive tasks involving abstraction and the ability to maintain a mental set. Prefrontal cortex, acting through subcortical basal ganglia pathways, is a component of the neural substrate that regulates human speech production, syntactic ability, and certain aspects of cognition. The deterioration of these subcortical pathways may explain similar phenomena in Broca's aphasia. Results are discussed in relation to "modular" theories.  相似文献   

11.
The goal of this study was to investigate the performance of Hall’s transformation of the Brunner-Dette-Munk (BDM) and Welch-James (WJ) test statistics and Box-Cox’s data transformation in factorial designs when normality and variance homogeneity assumptions were violated separately and jointly. On the basis of unweighted marginal means, we performed a simulation study to explore the operating characteristics of the methods proposed for a variety of distributions with small sample sizes. Monte Carlo simulation results showed that when data were sampled from symmetric distributions, the error rates of the original BDM and WJ tests were scarcely affected by the lack of normality and homogeneity of variance. In contrast, when data were sampled from skewed distributions, the original BDM and WJ rates were not well controlled. Under such circumstances, the results clearly revealed that Hall’s transformation of the BDM and WJ tests provided generally better control of Type I error rates than did the same tests based on Box-Cox’s data transformation. Among all the methods considered in this study, we also found that Hall’s transformation of the BDM test yielded the best control of Type I errors, although it was often less powerful than either of the WJ tests when both approaches reasonably controlled the error rates.  相似文献   

12.
L. V. Jones and J. W. Tukey (2000) pointed out that the usual 2-sided, equal-tails null hypothesis test at level alpha can be reinterpreted as simultaneous tests of 2 directional inequality hypotheses, each at level alpha/2, and that the maximum probability of a Type I error is alpha/2 if the truth of the null hypothesis is considered impossible. This article points out that in multiple testing with familywise error rate controlled at alpha, the directional error rate (assuming all null hypotheses are false) is greater than alpha/2 and can be arbitrarily close to alpha. Single-step, step-down, and step-up procedures are analyzed, and other error rates, including the false discovery rate, are discussed. Implications for confidence interval estimation and hypothesis testing practices are considered.  相似文献   

13.
Inaccuracies in administration and scoring can potentially compromise the validity of any standardized psychosocial measure. The threat is particularly pertinent to methods involving behavioral observation, a category that includes many intelligence tests, neuropsychological measures, personality assessment instruments, and diagnostic procedures. Despite evidence and conjecture that errors in testing procedure are common for at least some of these measures and that these errors are often severe enough to influence interpretation, the topic has received relatively little attention. In particular, the absence of any safeguard against inaccurate test use in clinical situations can put the respondent at risk and violates ethical standards for the use of tests. In this article, I review some issues surrounding accuracy in testing procedures, including a discussion of what is known about the problem, an evaluation of several approaches to improving testing practices, and a review of recommendations for the statistical evaluation of rater accuracy. In this article, I use the Rorschach Comprehensive System (Exner, 1993) to demonstrate the concepts discussed.  相似文献   

14.
People’s knowledge about the world often contains misconceptions that are well-learned and firmly believed. Although such misconceptions seem hard to correct, recent research has demonstrated that errors made with higher confidence are more likely to be corrected with feedback, a finding called the hypercorrection effect. We investigated whether this effect persists over a 1-week delay. Subjects answered general-knowledge questions about science, rated their confidence in each response, and received correct answer feedback. Half of the subjects reanswered the same questions immediately, while the other half reanswered them after a 1-week delay. The hypercorrection effect occurred on both the immediate and delayed final tests, but error correction decreased on the delayed test. When subjects failed to correct an error on the delayed test, they sometimes reproduced the same error from the initial test. Interestingly, high-confidence errors were more likely than low-confidence errors to be reproduced on the delayed test. These findings help to contextualize the hypercorrection effect within the broader memory literature by showing that high-confidence errors are more likely to be corrected, but they are also more likely to be reproduced if the correct answer is forgotten.  相似文献   

15.
The purpose of this study was to evaluate a modified test of equivalence for conducting normative comparisons when distribution shapes are non‐normal and variances are unequal. A Monte Carlo study was used to compare the empirical Type I error rates and power of the proposed Schuirmann–Yuen test of equivalence, which utilizes trimmed means, with that of the previously recommended Schuirmann and Schuirmann–Welch tests of equivalence when the assumptions of normality and variance homogeneity are satisfied, as well as when they are not satisfied. The empirical Type I error rates of the Schuirmann–Yuen were much closer to the nominal α level than those of the Schuirmann or Schuirmann–Welch tests, and the power of the Schuirmann–Yuen was substantially greater than that of the Schuirmann or Schuirmann–Welch tests when distributions were skewed or outliers were present. The Schuirmann–Yuen test is recommended for assessing clinical significance with normative comparisons.  相似文献   

16.
A composite step‐down procedure, in which a set of step‐down tests are summarized collectively with Fisher's combination statistic, was considered to test for multivariate mean equality in two‐group designs. An approximate degrees of freedom (ADF) composite procedure based on trimmed/Winsorized estimators and a non‐pooled estimate of error variance is proposed, and compared to a composite procedure based on trimmed/Winsorized estimators and a pooled estimate of error variance. The step‐down procedures were also compared to Hotelling's T2 and Johansen's ADF global procedure based on trimmed estimators in a simulation study. Type I error rates of the pooled step‐down procedure were sensitive to covariance heterogeneity in unbalanced designs; error rates were similar to those of Hotelling's T2 across all of the investigated conditions. Type I error rates of the ADF composite step‐down procedure were insensitive to covariance heterogeneity and less sensitive to the number of dependent variables when sample size was small than error rates of Johansen's test. The ADF composite step‐down procedure is recommended for testing hypotheses of mean equality in two‐group designs except when the data are sampled from populations with different degrees of multivariate skewness.  相似文献   

17.
Pan T  Yin Y 《心理学方法》2012,17(2):309-311
In the discussion of mean square difference (MSD) and standard error of measurement (SEM), Barchard (2012) concluded that the MSD between 2 sets of test scores is greater than 2(SEM)2 and SEM underestimates the score difference between 2 tests when the 2 tests are not parallel. This conclusion has limitations for 2 reasons. First, strictly speaking, MSD should not be compared to SEM because they measure different things, have different assumptions, and capture different sources of errors. Second, the related proof and conclusions in Barchard hold only under the assumptions of equal reliabilities, homogeneous variances, and independent measurement errors. To address the limitations, we propose that MSD should be compared to the standard error of measurement of difference scores (SEMx-y) so that the comparison can be extended to the conditions when 2 tests have unequal reliabilities and score variances.  相似文献   

18.
This exploratory study investigated the relationship between two measures of perceptual-motor problems. Ratios derived from the Object Assembly subtest of the WISC with errors greater than 1 SD on the Bender test using Koppitz' scoring system were compared. The sample of 66 white middle-class children was 6 to 11 yr. old. The particular ratio studied was statistically significantly associated with the Bender, whereas ratios of other Object Assembly items were not. The Horse ratio was also significantly associated with the presence of learning problems shown in children's files.  相似文献   

19.
The use of objectively validated projective tests in personnel decisions has been limited in recent years because of the perception that such tests are highly subjective, difficult to administer, and difficult to score in a reliable manner. The present paper demonstrates the use of a brief (½ hour) projective test battery consisting of the Bender–Gestalt, House–Tree–Person, and a free drawing test which can be administered in a personnel office and scored blindly using an objective scoring system. The study showed that such a battery could predict six month retention rates in a sample of recently hired corrections officers at statistically significant rates (2 = 6.25, p < 0.05) despite the fact that the individuals had already been thoroughly prescreened using the company's comprehensive normal procedures. The possible uses and advantages of a language-free projective battery are discussed along with future research directions.  相似文献   

20.
In three experiments, undergraduate subjects were asked to evaluate or choose between hypothetical medical tests. Subjects were told the subjective prior probability of a hypothetical disease, the hit rate and false-alarm rate of each test, and the relative subjective cost of the two possible errors that might be made. By varying priors, cost, and test accuracy, we could measure the influence of each parameter on subjects' responses. Subjects overweighed costs relative to both priors and test accuracy. In single-test cases in which the choice was whether to test or do something else (treat or withhold treatment), priors were not systematically misweighed relative to accuracy. When two tests were compared, priors were underweighed relative to accuracy. Justifications agreed with the conclusions reached by analysis of the preferences. When evaluating a test, subjects do not seem to understand that high priors make hit rates more relevant, while low priors make false-alarm rates more relevant. Subjects do, however, understand that a large cost of not treating diseased patients makes hit rates more relevant, while a large cost for treating nondiseased patients makes false-alarm rates more relevant. The overweighing of costs seems to result from the use of a heuristic in which the subject tends to minimize the probability of the worst kind of error, regardless of Other parameters.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号