首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Psychology is undergoing a replication crisis. The discussion surrounding this crisis has centered on mistrust of previous findings. Researchers planning replication studies often use the original study sample effect size as the basis for sample size planning. However, this strategy ignores uncertainty and publication bias in estimated effect sizes, resulting in overly optimistic calculations. A psychologist who intends to obtain power of .80 in the replication study, and performs calculations accordingly, may have an actual power lower than .80. We performed simulations to reveal the magnitude of the difference between actual and intended power based on common sample size planning strategies and assessed the performance of methods that aim to correct for effect size uncertainty and/or bias. Our results imply that even if original studies reflect actual phenomena and were conducted in the absence of questionable research practices, popular approaches to designing replication studies may result in a low success rate, especially if the original study is underpowered. Methods correcting for bias and/or uncertainty generally had higher actual power, but were not a panacea for an underpowered original study. Thus, it becomes imperative that 1) original studies are adequately powered and 2) replication studies are designed with methods that are more likely to yield the intended level of power.  相似文献   

2.
Bonett DG 《心理学方法》2008,13(2):99-109
Most psychology journals now require authors to report a sample value of effect size along with hypothesis testing results. The sample effect size value can be misleading because it contains sampling error. Authors often incorrectly interpret the sample effect size as if it were the population effect size. A simple solution to this problem is to report a confidence interval for the population value of the effect size. Standardized linear contrasts of means are useful measures of effect size in a wide variety of research applications. New confidence intervals for standardized linear contrasts of means are developed and may be applied to between-subjects designs, within-subjects designs, or mixed designs. The proposed confidence interval methods are easy to compute, do not require equal population variances, and perform better than the currently available methods when the population variances are not equal.  相似文献   

3.
It is difficult to obtain adequate power to test a small effect size with a set criterion alpha of 0.05. Probably an inferential test will indicate non-statistical significance and not be published. Rarely, statistical significance will be obtained, and an exaggerated effect size calculated and reported. Accepting all inferential probabilities and associated effect sizes could solve exaggeration problems. Graphs, generated through Monte Carlo methods, are presented to illustrate this. The first graph presents effect sizes (Cohen's d) as lines from 1 to 0 with probabilities on the Y axis and the number of measures on the X axis. This graph shows effect sizes of .5 or less should yield non-significance with sample sizes below 120 measures. The other graphs show results with as many as 10 small sample size replications. There is a convergence of means with the effect size as sample size increases and measurement accuracy emerges.  相似文献   

4.
仲晓波 《心理科学》2015,(4):807-812
严格意义的实验可重复性指的是实验控制条件不变的情况下其结果的可重复性,置信区间是表示这种可重复性的恰当方法,可重复性的提高可通过在实验设计和数据分析中将影响因变量的随机额外变量作为协变量引入来实现;另一种意义的可重复性指的是实验结果的可迁移性,它涉及当控制条件变化时因控制变量和自变量的交互作用而导致的实验结果的变化。在这两种意义下,心理学实验较低的可重复性都源于它的额外变量的庞杂。  相似文献   

5.
The authors argue that a robust version of Cohen's effect size constructed by replacing population means with 20% trimmed means and the population standard deviation with the square root of a 20% Winsorized variance is a better measure of population separation than is Cohen's effect size. The authors investigated coverage probability for confidence intervals for the new effect size measure. The confidence intervals were constructed by using the noncentral t distribution and the percentile bootstrap. Over the range of distributions and effect sizes investigated in the study, coverage probability was better for the percentile bootstrap confidence interval.  相似文献   

6.
Traditional null hypothesis significance testing does not yield the probability of the null or its alternative and, therefore, cannot logically ground scientific decisions. The decision theory proposed here calculates the expected utility of an effect on the basis of (1) the probability of replicating it and (2) a utility function on its size. It takes significance tests—which place all value on the replicability of an effect and none on its magnitude—as a special case, one in which the cost of a false positive is revealed to be an order of magnitude greater than the value of a true positive. More realistic utility functions credit both replicability and effect size, integrating them for a single index of merit. The analysis incorporates opportunity cost and is consistent with alternate measures of effect size, such as r2 and information transmission, and with Bayesian model selection criteria. An alternate formulation is functionally equivalent to the formal theory, transparent, and easy to compute.  相似文献   

7.
Replication studies frequently fail to detect genuine effects because too few subjects are employed to yield an acceptable level of power. To remedy this situation, a method of sample size determination in replication attempts is described that uses information supplied by the original experiment to establish a distribution of probable effect sizes. The sample size to be employed is that which supplies an expected power of the desired amount over the distribution of probable effect sizes. The method may be used in replication attempts involving the comparison of means, the comparison of correlation coefficients, and the comparison of proportions. The widely available equation-solving program EUREKA provides a rapid means of executing the method on a microcomputer. Only ten lines are required to represent the method as a set of equations in EUREKA’s language. Such an equation file is readily modified, so that even inexperienced users find it a straightforward means of obtaining the sample size for a variety of designs.  相似文献   

8.
A method of sample-size determination for use in attempts to replicate experiments is described. It is appropriate in situations where there is uncertainty about the magnitude of the effect under investigation. The procedure uses information supplied by the original experiment to establish a distribution of probable effect sizes. The sample size to be used in a replication study is that which provides an expected power of the desired amount over the distribution of probable effect sizes A FORTRAN 77 program is presented that permits rapid calculation of sample size in replication attempts employing comparisons of means, correlation coefficients, or proportions.  相似文献   

9.
ObjectivesWe aim to introduce the discussion on the crisis of confidence to sport and exercise psychology. We focus on an important aspect of this debate, the impact of sample sizes, by assessing sample sizes within sport and exercise psychology. Researchers have argued that publications in psychological research contain numerous false-positive findings and inflated effect sizes due to small sample sizes.MethodWe analyse the four leading journals in sport and exercise psychology regarding sample sizes of all quantitative studies published in these journals between 2009 and 2013. Subsequently, we conduct power analyses.ResultsA substantial proportion of published studies does not have sufficient power to detect effect sizes typical for psychological research. Sample sizes and power vary between research designs. Although many correlational studies have adequate sample sizes, experimental studies are often underpowered to detect small-to-medium effects.ConclusionsAs sample sizes are small, research in sport and exercise psychology may suffer from false-positive results and inflated effect sizes, while at the same time failing to detect meaningful small effects. Larger sample sizes are warranted, particularly in experimental studies.  相似文献   

10.
The probability of “replication,” prep, has been proposed as a means of identifying replicable and reliable effects in the psychological sciences. We conduct a basic test of prep that reveals that it misestimates the true probability of replication, especially for small effects. We show how these general problems with prep play out in practice, when it is applied to predict the replicability of observed effects over a series of experiments. Our results show that, over any plausible series of experiments, the true probabilities of replication will be very different from those predicted by prep. We discuss some basic problems in the formulation of prep that are responsible for its poor performance, and conclude that prep is not a useful statistic for psychological science.  相似文献   

11.
We conducted a close replication of the seminal work by Marcus and colleagues from 1999, which showed that after a brief auditory exposure phase, 7-month-old infants were able to learn and generalize a rule to novel syllables not previously present in the exposure phase. This work became the foundation for the theoretical framework by which we assume that infants are able to learn abstract representations and generalize linguistic rules. While some extensions on the original work have shown evidence of rule learning, the outcomes are mixed, and an exact replication of Marcus et al.'s study has thus far not been reported. A recent meta-analysis by Rabagliati and colleagues brings to light that the rule-learning effect depends on stimulus type (e.g., meaningfulness, speech vs. nonspeech) and is not as robust as often assumed. In light of the theoretical importance of the issue at stake, it is appropriate and necessary to assess the replicability and robustness of Marcus et al.'s findings. Here we have undertaken a replication across four labs with a large sample of 7-month-old infants (= 96), using the same exposure patterns (ABA and ABB), methodology (Headturn Preference Paradigm), and original stimuli. As in the original study, we tested the hypothesis that infants are able to learn abstract “algebraic” rules and apply them to novel input. Our results did not replicate the original findings: infants showed no difference in looking time between test patterns consistent or inconsistent with the familiarization pattern they were exposed to.  相似文献   

12.
In the last few years, the field of psychology has been challenged with a crisis in the rigor and reproducibility of the science. The focus of these issues has primarily been in social, cognitive, and cognitive neuroscience psychology, however, the area of developmental research is not immune to these issues. This paper provides an overview of the “replication crisis” and the choices made by researchers that are often not noted in methods, thus making the replication of studies more difficult. In this review we discuss issues of researcher flexibility in the data design and selection of sample size, collection, and analysis stages of research. In each of these areas we address examples of bias and how developmental researchers can address these issues in their own research.  相似文献   

13.
In conventional frequentist power analysis, one often uses an effect size estimate, treats it as if it were the true value, and ignores uncertainty in the effect size estimate for the analysis. The resulting sample sizes can vary dramatically depending on the chosen effect size value. To resolve the problem, we propose a hybrid Bayesian power analysis procedure that models uncertainty in the effect size estimates from a meta-analysis. We use observed effect sizes and prior distributions to obtain the posterior distribution of the effect size and model parameters. Then, we simulate effect sizes from the obtained posterior distribution. For each simulated effect size, we obtain a power value. With an estimated power distribution for a given sample size, we can estimate the probability of reaching a power level or higher and the expected power. With a range of planned sample sizes, we can generate a power assurance curve. Both the conventional frequentist and our Bayesian procedures were applied to conduct prospective power analyses for two meta-analysis examples (testing standardized mean differences in example 1 and Pearson's correlations in example 2). The advantages of our proposed procedure are demonstrated and discussed.  相似文献   

14.
The equality of two group variances is frequently tested in experiments. However, criticisms of null hypothesis statistical testing on means have recently arisen and there is interest in other types of statistical tests of hypotheses, such as superiority/non-inferiority and equivalence. Although these tests have become more common in psychology and social sciences, the corresponding sample size estimation for these tests is rarely discussed, especially when the sampling unit costs are unequal or group sizes are unequal for two groups. Thus, for finding optimal sample size, the present study derived an initial allocation by approximating the percentiles of an F distribution with the percentiles of the standard normal distribution and used the exhaustion algorithm to select the best combination of group sizes, thereby ensuring the resulting power reaches the designated level and is maximal with a minimal total cost. In this manner, optimization of sample size planning is achieved. The proposed sample size determination has a wide range of applications and is efficient in terms of Type I errors and statistical power in simulations. Finally, an illustrative example from a report by the Health Survey for England, 1995–1997, is presented using hypertension data. For ease of application, four R Shiny apps are provided and benchmarks for setting equivalence margins are suggested.  相似文献   

15.
Some have proposed that the null hypothesis significance test, as usually conducted using the t test of the difference between means, is an impediment to progress in psychology. To improve its prospects, using Neyman-Pearson confidence intervals and Cohen's standardized effect sizes, d, is recommended. The purpose of these approaches is to enable us to understand what can appropriately be said about the distances between the means and their reliability. Others have written extensively that these recommended strategies are highly interrelated and use identical information. This essay was written to remind us that the t test, based on the sample—not the true—standard deviation, does not apply solely to distance between means. The t test pertains to a much more ambiguous specification: the difference between samples, including sampling variations of the standard deviation.  相似文献   

16.
The accuracy of science depends on the precision of its methods. When fields produce precise measurements, the scientific method can generate remarkable gains in knowledge. When fields produce noisy measurements, however, the scientific method is not guaranteed to work – in fact, noisy measurements are now regarded as a leading cause of the replication crisis in psychology. Scientists should therefore strive to improve the precision of their methods, especially in fields with noisy measurements. Here, we show that automation can reduce measurement error by ∼60% in one domain of developmental psychology: controlled-rearing studies of newborn chicks. Automated studies produce measurements that are 3–4 times more precise than non-automated studies and produce effect sizes that are 3–4 times larger than non-automated studies. Automation also eliminates experimenter bias and allows replications to be performed quickly and easily. We suggest that automation can be a powerful tool for improving measurement precision, producing high powered experiments, and combating the replication crisis.  相似文献   

17.
This study presents a data set for a reference group on the Reitan-Indiana Neuropsychological Test Battery for Young Children. The data set is based on a sample of 224 children, ages 5 to 8 years, referred to a special services cooperative for academic or behavioral concerns during the years 1980 through 1993. Data are presented in terms of sample size, means, standard deviations, diagnostic classifications, and population characteristics. Previously published data sets are reviewed in comparison to this newly acquired data set. Potential advantages of this data set include the larger sample, contemporary data collection, and a sample drawn from a United States school-referred population.  相似文献   

18.
The paper takes up the problem of performing all pairwise comparisons amongJ independent groups based on 20% trimmed means. Currently, a method that stands out is the percentile-t bootstrap method where the bootstrap is used to estimate the quantiles of a Studentized maximum modulus distribution when all pairs of population trimmed means are equal. However, a concern is that in simulations, the actual probability of one or more Type I errors can drop well below the nominal level when sample sizes are small. A practical issue is whether a method can be found that corrects this problem while maintaining the positive features of the percentile-t bootstrap. Three new methods are considered here, one of which achieves the desired goal. Another method, which takes advantage of theoretical results by Singh (1998), performs almost as well but is not recommended when the smallest sample size drops below 15. In some situations, however, it gives substantially shorter confidence intervals.  相似文献   

19.
In this paper, the performance of six types of techniques for comparisons of means is examined. These six emerge from the distinction between the method employed (hypothesis testing, model selection using information criteria, or Bayesian model selection) and the set of hypotheses that is investigated (a classical, exploration‐based set of hypotheses containing equality constraints on the means, or a theory‐based limited set of hypotheses with equality and/or order restrictions). A simulation study is conducted to examine the performance of these techniques. We demonstrate that, if one has specific, a priori specified hypotheses, confirmation (i.e., investigating theory‐based hypotheses) has advantages over exploration (i.e., examining all possible equality‐constrained hypotheses). Furthermore, examining reasonable order‐restricted hypotheses has more power to detect the true effect/non‐null hypothesis than evaluating only equality restrictions. Additionally, when investigating more than one theory‐based hypothesis, model selection is preferred over hypothesis testing. Because of the first two results, we further examine the techniques that are able to evaluate order restrictions in a confirmatory fashion by examining their performance when the homogeneity of variance assumption is violated. Results show that the techniques are robust to heterogeneity when the sample sizes are equal. When the sample sizes are unequal, the performance is affected by heterogeneity. The size and direction of the deviations from the baseline, where there is no heterogeneity, depend on the effect size (of the means) and on the trend in the group variances with respect to the ordering of the group sizes. Importantly, the deviations are less pronounced when the group variances and sizes exhibit the same trend (e.g., are both increasing with group number).  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号