首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
According to Wollack and Schoenig (2018, The Sage encyclopedia of educational research, measurement, and evaluation. Thousand Oaks, CA: Sage, 260), benefiting from item preknowledge is one of the three broad types of test fraud that occur in educational assessments. We use tools from constrained statistical inference to suggest a new statistic that is based on item scores and response times and can be used to detect examinees who may have benefited from item preknowledge for the case when the set of compromised items is known. The asymptotic distribution of the new statistic under no preknowledge is proved to be a simple mixture of two χ2 distributions. We perform a detailed simulation study to show that the Type I error rate of the new statistic is very close to the nominal level and that the power of the new statistic is satisfactory in comparison to that of the existing statistics for detecting item preknowledge based on both item scores and response times. We also include a real data example to demonstrate the usefulness of the suggested statistic.  相似文献   

2.
Adapting Edgington's [J. Psychol. 90 (1975) 57] randomly determined intervention start-point model, Levin and Wampold [Sch. Psychol. Quart. 14 (1999) 59] proposed a set of nonparametric randomization tests for analyzing the data from single-case designs. In the present study, the performance of Levin and Wampold's four basic tests (independent start-point general and comparative effectiveness, simultaneous start-point general and comparative effectiveness) was examined with respect to their Type I error rates and statistical power. Of Levin and Wampold's four tests, all except the independent start-point comparative effectiveness test maintained their empirical Type I error rates and had acceptable power at larger sample-size and effect-size combinations. The one-tailed comparative intervention effectiveness test for the independent start-point model was found to be too liberal, in that it did not maintain its Type I error rate. Although a two-tailed application of that test was found to be conservative at longer series lengths, it had acceptable power at larger sample-size and effect-size combinations. The results support the utility of a versatile new class of single-case designs that permit both within- and between-unit statistical assessments of intervention effectiveness.  相似文献   

3.
Student's one-sample t-test is a commonly used method when inference about the population mean is made. As advocated in textbooks and articles, the assumption of normality is often checked by a preliminary goodness-of-fit (GOF) test. In a paper recently published by Schucany and Ng it was shown that, for the uniform distribution, screening of samples by a pretest for normality leads to a more conservative conditional Type I error rate than application of the one-sample t-test without preliminary GOF test. In contrast, for the exponential distribution, the conditional level is even more elevated than the Type I error rate of the t-test without pretest. We examine the reasons behind these characteristics. In a simulation study, samples drawn from the exponential, lognormal, uniform, Student's t-distribution with 2 degrees of freedom (t(2) ) and the standard normal distribution that had passed normality screening, as well as the ingredients of the test statistics calculated from these samples, are investigated. For non-normal distributions, we found that preliminary testing for normality may change the distribution of means and standard deviations of the selected samples as well as the correlation between them (if the underlying distribution is non-symmetric), thus leading to altered distributions of the resulting test statistics. It is shown that for skewed distributions the excess in Type I error rate may be even more pronounced when testing one-sided hypotheses.  相似文献   

4.
We study several aspects of bootstrap inference for covariance structure models based on three test statistics, including Type I error, power and sample‐size determination. Specifically, we discuss conditions for a test statistic to achieve a more accurate level of Type I error, both in theory and in practice. Details on power analysis and sample‐size determination are given. For data sets with heavy tails, we propose applying a bootstrap methodology to a transformed sample by a downweighting procedure. One of the key conditions for safe bootstrap inference is generally satisfied by the transformed sample but may not be satisfied by the original sample with heavy tails. Several data sets illustrate that, by combining downweighting and bootstrapping, a researcher may find a nearly optimal procedure for evaluating various aspects of covariance structure models. A rule for handling non‐convergence problems in bootstrap replications is proposed.  相似文献   

5.
Experiments often produce a hit rate and a false alarm rate in each of two conditions. These response rates are summarized into a single-point sensitivity measure such as d', and t tests are conducted to test for experimental effects. Using large-scale Monte Carlo simulations, we evaluate the Type I error rates and power that result from four commonly used single-point measures: d', A', percent correct, and gamma. We also test a newly proposed measure called gammaC. For all measures, we consider several ways of handling cases in which false alarm rate = 0 or hit rate = 1. The results of our simulations indicate that power is similar for these measures but that the Type I error rates are often unacceptably high. Type I errors are minimized when the selected sensitivity measure is theoretically appropriate for the data.  相似文献   

6.
Most dichotomous item response models share the assumption of latent monotonicity, which states that the probability of a positive response to an item is a nondecreasing function of a latent variable intended to be measured. Latent monotonicity cannot be evaluated directly, but it implies manifest monotonicity across a variety of observed scores, such as the restscore, a single item score, and in some cases the total score. In this study, we show that manifest monotonicity can be tested by means of the order-constrained statistical inference framework. We propose a procedure that uses this framework to determine whether manifest monotonicity should be rejected for specific items. This approach provides a likelihood ratio test for which the p-value can be approximated through simulation. A simulation study is presented that evaluates the Type I error rate and power of the test, and the procedure is applied to empirical data.  相似文献   

7.
In a recent article in The Journal of General Psychology, J. B. Hittner, K. May, and N. C. Silver (2003) described their investigation of several methods for comparing dependent correlations and found that all can be unsatisfactory, in terms of Type I errors, even with a sample size of 300. More precisely, when researchers test at the .05 level, the actual Type I error probability can exceed .10. The authors of this article extended J. B. Hittner et al.'s research by considering a variety of alternative methods. They found 3 that avoid inflating the Type I error rate above the nominal level. However, a Monte Carlo simulation demonstrated that when the underlying distribution of scores violated the assumption of normality, 2 of these methods had relatively low power and had actual Type I error rates well below the nominal level. The authors report comparisons with E. J. Williams' (1959) method.  相似文献   

8.
We examine methods for measuring performance in signal-detection-like tasks when each participant provides only a few observations. Monte Carlo simulations demonstrate that standard statistical techniques applied to ad’ analysis can lead to large numbers of Type I errors (incorrectly rejecting a hypothesis of no difference). Various statistical methods were compared in terms of their Type I and Type II error (incorrectly accepting a hypothesis of no difference) rates. Our conclusions are the same whether these two types of errors are weighted equally or Type I errors are weighted more heavily. The most promising method is to combine an aggregated’ measure with a percentile bootstrap confidence interval, a computerintensive nonparametric method of statistical inference. Researchers who prefer statistical techniques more commonly used in psychology, such as a repeated measurest test, should useγ (Goodman & Kruskal, 1954), since it performs slightly better than or nearly as well asd’. In general, when repeated measurest tests are used,γ is more conservative thand’: It makes more Type II errors, but its Type I error rate tends to be much closer to that of the traditional .05 α level. It is somewhat surprising thatγ performs as well as it does, given that the simulations that generated the hypothetical data conformed completely to thed’ model. Analyses in which H—FA was used had the highest Type I error rates. Detailed simulation results can be downloaded fromwww.psychonomic.org/archive/Schooler-BRM-2004.zip.  相似文献   

9.
基于改进的Wald统计量,将适用于两群组的DIF检测方法拓展至多群组的项目功能差异(DIF)检验;改进的Wald统计量将分别通过计算观察信息矩阵(Obs)和经验交叉相乘信息矩阵(XPD)而得到。模拟研究探讨了此二者与传统计算方法在多个群组下的DIF检验情况,结果表明:(1)Obs和XPD的一类错误率明显低于传统方法,DINA模型估计下Obs和XPD的一类错误率接近理论水平;(2)样本量和DIF量较大时,Obs和XPD具有与传统Wald统计量大体相同的统计检验力。  相似文献   

10.
It is well known that when data are nonnormally distributed, a test of the significance of Pearson's r may inflate Type I error rates and reduce power. Statistics textbooks and the simulation literature provide several alternatives to Pearson's correlation. However, the relative performance of these alternatives has been unclear. Two simulation studies were conducted to compare 12 methods, including Pearson, Spearman's rank-order, transformation, and resampling approaches. With most sample sizes (n ≥ 20), Type I and Type II error rates were minimized by transforming the data to a normal shape prior to assessing the Pearson correlation. Among transformation approaches, a general purpose rank-based inverse normal transformation (i.e., transformation to rankit scores) was most beneficial. However, when samples were both small (n ≤ 10) and extremely nonnormal, the permutation test often outperformed other alternatives, including various bootstrap tests. (PsycINFO Database Record (c) 2012 APA, all rights reserved).  相似文献   

11.
Variable Error     
The degree to which blocked (VE) data satisfies the assumptions of compound symmetry required for a repeated measures ANOVA was studied. Monte Carlo procedures were used to study the effect of violation of this assumption, under varying block sizes, on the Type I error rate. Populations of 10,000 subjects for each of two groups, the underlying variance-covariance matrices reflecting a specific condition of violation of the homogeneity of covariance assumptions, were generated based on each of three actual experimental data sets. The data were blocked in various ways, VE calculated, and subsequently analyzed by a repeated measures ANOVA. The complete process was replicated for four covariance homogeneity conditions for each of the three data sets, resulting in a total of 22,000 simulated experiments. Results indicated that the Type I error rate increases as the degree of heterogeneity within the variance-covariance matrices increases when raw (unblocked) data are analyzed. With VE, the effects of within-matrix heterogeneity on the Type I error rate are inconclusive. However, block size does seem to affect the probability of obtaining a significant interaction, but the nature of this relationship is not clear as there does not appear to be any consistent relationship between the size of the block and the probability of obtaining significance. For both raw and VE data there was no inflation in the number of Type I errors when the covariances within a given matrix were homogeneous, regardless of the differences between the group variance-covariance matrices.  相似文献   

12.
The Type I error probability and the power of the independent samples t test, performed directly on the ranks of scores in combined samples in place of the original scores, are known to be the same as those of the non‐parametric Wilcoxon–Mann–Whitney (WMW) test. In the present study, simulations revealed that these probabilities remain essentially unchanged when the number of ranks is reduced by assigning the same rank to multiple ordered scores. For example, if 200 ranks are reduced to as few as 20, or 10, or 5 ranks by replacing sequences of consecutive ranks by a single number, the Type I error probability and power stay about the same. Significance tests performed on these modular ranks consistently reproduce familiar findings about the comparative power of the t test and the WMW tests for normal and various non‐normal distributions. Similar results are obtained for modular ranks used in comparing the one‐sample t test and the Wilcoxon signed ranks test.  相似文献   

13.
14.
The purpose of the present study was to investigate the statistical properties of two extensions of the Levin-Wampold (1999) single-case simultaneous start-point model's comparative effectiveness randomization test. The two extensions were (a) adapting the test to situations where there are more than two different intervention conditions and (b) examining the test's performance in classroom-based intervention situations, where the number of time periods (and associated outcome observations) is much smaller than in the contexts for which the test was originally developed. Various Monte Carlo sampling situations were investigated, including from one to five participant blocks per condition and differing numbers of time periods, potential intervention start points, degrees of within-phase autocorrelation, and effect sizes. For all situations, it was found that the Type I error probability of the randomization test was maintained at an acceptable level. With a few notable exceptions, respectable power was observed only in situations where the numbers of observations and potential intervention start points were relatively large, effect sizes were large, and the degree of within-phase autocorrelation was relatively low. It was concluded that the comparative effectiveness randomization test, with its desirable internal validity and statistical-conclusion validity features, is a versatile analytic tool that can be incorporated into a variety of single-case school psychology intervention research situations.  相似文献   

15.
In simulation studies, the F test for differences in regression slopes has tended to distort nominal Type I and II error rates when the 2 subgroup error variances exceeded a 1.50:1 ratio. This study examines the frequency and extent that this ratio is violated within data sets relevant to applied psychology. The General Aptitude Test Battery (GATB) validity study database contained ability data and overall job performance ratings. The Project A military database contained both ability and personality data, along with job performance factor scores and an overall job performance rating. Results suggest that subgroup (White-Black, male-female) error variances are often homogeneous enough to support F test results from past empirical work. Enough heterogeneity was found, however, to urge applied psychologists investigating differential prediction to explore their data and consider the possibility of alternative statistical tests.  相似文献   

16.
Adverse impact evaluations often call for evidence that the disparity between groups in selection rates is statistically significant, and practitioners must choose which test statistic to apply in this situation. To identify the most effective testing procedure, the authors compared several alternate test statistics in terms of Type I error rates and power, focusing on situations with small samples. Significance testing was found to be of limited value because of low power for all tests. Among the alternate test statistics, the widely-used Z-test on the difference between two proportions performed reasonably well, except when sample size was extremely small. A test suggested by G. J. G. Upton (1982) provided slightly better control of Type I error under some conditions but generally produced results similar to the Z-test. Use of the Fisher Exact Test and Yates's continuity-corrected chi-square test are not recommended because of overly conservative Type I error rates and substantially lower power than the Z-test.  相似文献   

17.
We discuss the statistical testing of three relevant hypotheses involving Cronbach's alpha: one where alpha equals a particular criterion; a second testing the equality of two alpha coefficients for independent samples; and a third testing the equality of two alpha coefficients for dependent samples. For each of these hypotheses, various statistical tests have been proposed. Over the years, these tests have depended on progressively fewer assumptions. We propose a new approach to testing the three hypotheses that relies on even fewer assumptions, is especially suited for discrete item scores, and can be applied easily to tests containing large numbers of items. The new approach uses marginal modelling. We compared the Type I error rate and the power of the marginal modelling approach to several of the available tests in a simulation study using realistic conditions. We found that the marginal modelling approach had the most accurate Type I error rates, whereas the power was similar across the statistical tests.  相似文献   

18.
When more than one significance test is carried out on data from a single experiment, researchers are often concerned with the probability of one or more Type I errors over the entire set of tests. This article considers several methods of exercising control over that probability (the so-called family-wise Type I error rate), provides a schematic that can be used by a researcher to choose among the methods, and discusses applications to contingency tables.  相似文献   

19.
Categorical moderators are often included in mixed-effects meta-analysis to explain heterogeneity in effect sizes. An assumption in tests of categorical moderator effects is that of a constant between-study variance across all levels of the moderator. Although it rarely receives serious thought, there can be statistical ramifications to upholding this assumption. We propose that researchers should instead default to assuming unequal between-study variances when analysing categorical moderators. To achieve this, we suggest using a mixed-effects location-scale model (MELSM) to allow group-specific estimates for the between-study variance. In two extensive simulation studies, we show that in terms of Type I error and statistical power, little is lost by using the MELSM for moderator tests, but there can be serious costs when an equal variance mixed-effects model (MEM) is used. Most notably, in scenarios with balanced sample sizes or equal between-study variance, the Type I error and power rates are nearly identical between the MEM and the MELSM. On the other hand, with imbalanced sample sizes and unequal variances, the Type I error rate under the MEM can be grossly inflated or overly conservative, whereas the MELSM does comparatively well in controlling the Type I error across the majority of cases. A notable exception where the MELSM did not clearly outperform the MEM was in the case of few studies (e.g., 5). With respect to power, the MELSM had similar or higher power than the MEM in conditions where the latter produced non-inflated Type 1 error rates. Together, our results support the idea that assuming unequal between-study variances is preferred as a default strategy when testing categorical moderators.  相似文献   

20.
A person fit test based on the Lagrange multiplier test is presented for three item response theory models for polytomous items: the generalized partial credit model, the sequential model, and the graded response model. The test can also be used in the framework of multidimensional ability parameters. It is shown that the Lagrange multiplier statistic can take both the effects of estimation of the item parameters and the estimation of the person parameters into account. The Lagrange multiplier statistic has an asymptotic χ2-distribution. The Type I error rate and power are investigated using simulation studies. Results show that test statistics that ignore the effects of estimation of the persons’ ability parameters have decreased Type I error rates and power. Incorporating a correction to account for the effects of the estimation of the persons’ ability parameters results in acceptable Type I error rates and power characteristics; incorporating a correction for the estimation of the item parameters has very little additional effect. It is investigated to what extent the three models give comparable results, both in the simulation studies and in an example using data from the NEO Personality Inventory-Revised.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号