首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Users of interobserver agreement statistics have heretofore ignored the problem of autocorrelation in behavior sequences when testing the statistical significance of agreement measures. Due to autocorrelation traditional reliability tests based on the 2 × 2 contingency-table model (e.g., kappa, phi) are incorrect. Correct tests can be developed by using the bivariate time series as a statistical model. Seen from this perspective, testing the significance of interobserver agreement becomes formally equivalent to testing the significance of the lag-zero cross-correlation between two time series. The robust procedure known as the jackknife is suggested for this purpose.  相似文献   

2.
Factors affecting interobserver agreement (reliability) with a comprehensive coding system in the naturalistic observation of children were examined. Data from 117 pairs of observations on 35 children and their families were examined with respect to reliability and three possible covariates: response frequency, observation complexity, and code definition clarity. Analysis of results strongly supported response frequency as a positive covariate of interobserver agreement. Complexity was found to negatively covary with interobserver agreement. The relationship between code clarity and reliability was in the predicted direction but failed to obtain statistical significance. Implications for observer training and data collection in observational studies are discussed.An earlier version of this article was prepared for presentation at The Association for the Advancement of Behavior Therapy, Atlanta, Georgia, December 1977.  相似文献   

3.
4.
Percentage agreement measures of interobserver agreement or "reliability" have traditionally been used to summarize observer agreement from studies using interval recording, time-sampling, and trial-scoring data collection procedures. Recent articles disagree on whether to continue using these percentage agreement measures, and on which ones to use, and what to do about chance agreements if their use is continued. Much of the disagreement derives from the need to be reasonably certain we do not accept as evidence of true interobserver agreement those agreement levels which are substantially probable as a result of chance observer agreement. The various percentage agreement measures are shown to be adequate to this task, but easier ways are discussed. Tables are given to permit checking to see if obtained disagreements are unlikely due to chance. Particularly important is the discovery of a simple rule that, when met, makes the tables unnecessary. If reliability checks using 50 or more observation occasions produce 10% or fewer disagreements, for behavior rates from 10% through 90%, the agreement achieved is quite improbably the result of chance agreement.  相似文献   

5.
Experiments are considered where each of a sample of subjects is assigned to one of C categories separately by each of a fixed or varying group of observers. Building on earlier publications, general procedures are proposed to analyze agreements and disagreements among observers. In the case of a varying group of observers, it is shown that it is not necessary to demand a constant number of observers per subject. In the case of a fixed group of observers, the problem of missing data is considered.The procedures are illustrated within the context of two clinical diagnosis examples. In the first example it is investigated which categories are relatively hard to distinguish from one another; a new theorem is applied that shows a useful property of the statistic kappa. In the second example it is investigated if a subgroup of observers can be found with a significantly higher degree of interobserver agreement.The author gratefully acknowledges the valuable suggestions by W. Molenaar, R. van Strik, R. Popping and the referees.  相似文献   

6.
We demonstrate some procedures in the statistical computing environment R for obtaining maximum likelihood estimates of the parameters of a psychometric function by fitting a generalized nonlinear regression model to the data. A feature for fitting a linear model to the threshold (or other) parameters of several psychometric functions simultaneously provides a powerful tool for testing hypotheses about the data and, potentially, for reducing the number of parameters necessary to describe them. Finally, we illustrate procedures for treating one parameter as a random effect that would permit a simplified approach to modeling stimulus-independent variability due to factors such as lapses or interobserver differences. These tools will facilitate a more comprehensive and explicit approach to the modeling of psychometric data.  相似文献   

7.
We examined articles with experiments published in the Journal of Applied Behavior Analysis and in Behavior Analysis in Practice from 2017 through 2021 to determine how frequently procedural fidelity was assessed. When procedural fidelity was assessed, we determined how often a measure of interobserver agreement for those fidelity data was provided. We also determined how often a measure of interobserver agreement for participants' behavior was provided. Across both journals and all years, 54.7% of relevant articles provided a measure of procedural fidelity. Of them, 17.7% provided a measure of interobserver agreement for procedural fidelity. In marked contrast, 96.4% provided interobserver agreement data for participants' behavior. It is unfortunate that applied behavior analysts frequently fail to provide procedural fidelity data and, when they do, often fail to provide interobserver agreement data for the fidelity data. Reviewers for, and editors of, behavior-analytic journals are encouraged to strongly consider the relative value of procedural fidelity and agreement on procedural fidelity measures when rendering recommendations on the suitability of a given submission.  相似文献   

8.
The purpose of this article is to outline the development of the Multiple Option Observation System for Experimental Studies (MOOSES), a flexible data collection package for applied behavioral research. Several data collection options are available to users of MOOSES. Event-based recording, interaction-based recording, duration recording, and interval recording are available to the users and can be used individually or together, depending upon the research question. The collection program can incorporate any of the keys on the keyboard. Function keys on the top or side are used for toggle (duration states) type data collection. Types of analysis include frequency and duration of discrete events, frequency of general behavior states, frequency and duration of events within behavioral states, percent interval analysis, sequential analysis, and interobserver agreement. Data obtained from MOOSES is easily incorporated with other data for further statistical analysis with standard statistical packages or popular spreadsheet programs. Applications of MOOSES and its uses in social interaction research are presented. Comparisons with other similar systems are provided.  相似文献   

9.
Some of the contradictions in psychological research may be attributable to failure to distinguish statistical from clinical significance. 82 articles in which the MMPI was the research instrument were analyzed to see how often the results reported as significant were in fact large enough to warrant such a conclusion. Articles were classified as to whether or not the clinical interpretations were consistent with the statistical results. Excluding articles in which data were insufficient to reach an independent conclusion, 54.90% of the articles presented conclusions of clinical significance that were not supported by the data, while 45.10% reported clinical results that were supported by the data.  相似文献   

10.
Portable electronic data collection devices permit investigators to collect large amounts of observational data in a form ready for computer analysis. These devices are particularly efficient for gathering continuous data on multiple behavior categories. We expect that the increasing availability of these devices will lead to greater use of continuous data collection methods in observational research. This paper addresses the difficulties encountered when calculating traditional interobserver agreement statistics for continuous, multiple-code scoring. Two alternative strategies are described that yield interobserver agreement values based on the exact time of behavior code entries by the primary and secondary observers.Work on this paper was supported in part by NICHD Grants P01HD15051 and R01HD17650 and Office of Special Education and Rehabilitation Services Grant G008302980.  相似文献   

11.
差数显著性t检验与元分析的对比研究   总被引:5,自引:0,他引:5  
郭春彦  朱滢 《心理学报》1997,30(4):436-442
利用计算机构造被试总体、模拟实验研究程序进行抽样研究,探讨显著性t检验方法与元分析方法在检验实验结果数据方面的差异。在模拟实验过程中,t验受到显著性水平、样本容量和总体效果大小的影响,因此最终影响了统计推断的可靠性,建议:在进行显著性检验过程中,应对统计检验能力进行估计;元分析方法以样本为元素对总体进行推断,因此具有很高的准确性和可靠性,它将很有可能成为今后心理学研究的重要统计工具。  相似文献   

12.
Chow SL 《The Behavioral and brain sciences》1998,21(2):169-94; discussion 194-239
The null-hypothesis significance-test procedure (NHSTP) is defended in the context of the theory-corroboration experiment, as well as the following contrasts: (a) substantive hypotheses versus statistical hypotheses, (b) theory corroboration versus statistical hypothesis testing, (c) theoretical inference versus statistical decision, (d) experiments versus nonexperimental studies, and (e) theory corroboration versus treatment assessment. The null hypothesis can be true because it is the hypothesis that errors are randomly distributed in data. Moreover, the null hypothesis is never used as a categorical proposition. Statistical significance means only that chance influences can be excluded as an explanation of data; it does not identify the nonchance factor responsible. The experimental conclusion is drawn with the inductive principle underlying the experimental design. A chain of deductive arguments gives rise to the theoretical conclusion via the experimental conclusion. The anomalous relationship between statistical significance and the effect size often used to criticize NHSTP is more apparent than real. The absolute size of the effect is not an index of evidential support for the substantive hypothesis. Nor is the effect size, by itself, informative as to the practical importance of the research result. Being a conditional probability, statistical power cannot be the a priori probability of statistical significance. The validity of statistical power is debatable because statistical significance is determined with a single sampling distribution of the test statistic based on H0, whereas it takes two distributions to represent statistical power or effect size. Sample size should not be determined in the mechanical manner envisaged in power analysis. It is inappropriate to criticize NHSTP for nonstatistical reasons. At the same time, neither effect size, nor confidence interval estimate, nor posterior probability can be used to exclude chance as an explanation of data. Neither can any of them fulfill the nonstatistical functions expected of them by critics.  相似文献   

13.
Previous recommendations to employ occurrence, nonoccurrence, and overall estimates of interobserver reliability for interval data are reviewed. A rationale for comparing obtained reliability to reliability that would result from a random-chance model is explained. Formulae and graphic functions are presented to allow for the determination of chance agreement for each of the three indices, given any obtained per cent of intervals in which a response is recorded to occur. All indices are interpretable throughout the range of possible obtained values for the per cent of intervals in which a response is recorded. The level of chance agreement simply changes with changing values. Statistical procedures that could be used to determine whether obtained reliability is significantly superior to chance reliability are reviewed. These procedures are rejected because they yield significance levels that are partly a function of sample sizes and because there are no general rules to govern acceptable significance levels depending on the sizes of samples employed.  相似文献   

14.
Estimates of observer agreement are necessary to assess the acceptability of interval data. A common method for assessing observer agreement, per cent agreement, includes several major weaknesses and varies as a function of the frequency of behavior recorded and the inclusion or exclusion of agreements on nonoccurrences. Also, agreements that might be expected to occur by chance are not taken into account. An alternative method for assessing observer agreement that determines the exact probability that the obtained number of agreements or better would have occurred by chance is presented and explained. Agreements on both occurrences and nonoccurrences of behavior are considered in the calculation of this probability.  相似文献   

15.
Although multidimensional scaling (MDS) profile analysis is widely used to study individual differences, there is no objective way to evaluate the statistical significance of the estimated scale values. In the present study, a resampling technique (bootstrapping) was used to construct confidence limits for scale values estimated from MDS profile analysis. These bootstrap confidence limits were used, in turn, to evaluate the significance of marker variables of the profiles. The results from analyses of both simulation data and real data suggest that the bootstrap method may be valid and may be used to evaluate hypotheses about the statistical significance of marker variables of MDS profiles.  相似文献   

16.
David Klahr 《Psychometrika》1969,34(3):319-330
Recent advances in computer based psychometric techniques have yielded a collection of powerful tools for analyzing nonmetric data. These tools, although particularly well suited to the behavioral sciences, have several potential pitfalls. Among other things, there is no statistical test for evaluating the significance of the results. This paper provides estimates of the statistical significance of results yielded by Kruskal's nonmetric multidimensional scaling. The estimates, obtained from attempts to scale many randomly generated sets of data, reveal the relative frequency with which apparent structure is erroneously found in unstructured data. For a small number of points (i.e., six or seven) it is very likely that a good fit will be obtained in two or more dimensions when in fact the data are generated by a random process. The estimates presented here can be used as a bench mark against which to evaluate the significance of the results obtained from empirically based nonmetric multidimensional scaling.A preliminary version of this paper was presented at the International Federation for Information Processing Congress 68 in Edinburgh, Scotland, August 5–10, 1968.  相似文献   

17.
The present study was designed to evaluate the outcomes of a day treatment program for 55 eating disordered (ED) patients using clinical and statistical significance testing. Results indicated a statistically significant reduction on all eating disordered outcomes. With respect to clinical significance testing, analysis of these data indicated that the majority of the individuals in the day treatment program made clinically significant and reliable change by the termination of treatment on all eating disorder measures. However, considerably less patients improved to such a point that they were asymptomatic. The importance of combining clinical significance testing with traditional significance testing is discussed.  相似文献   

18.
Identifying true statistical dependencies in visual-scanning data involves showing that the observed scanning pattern is significantly more ordered than that which would be produced by a stratified random-sampling model. In the past, entropy has been used as the index to measure statistical order or dependency. Due to the unknown nature of the underlying sampling distributions of entropy, however, researchers have had to use relatively less powerful nonparametric statistical tests to determine significance. In this paper we present relevant portions of the family of sampling distributions of entropy and show that they are sufficiently normally distributed to allow the use of a more powerful parametric statistical test when attempting to distinguish among the different models of sampling.  相似文献   

19.
Discounting is the process by which outcomes lose value. Much of discounting research has focused on differences in the degree of discounting across various groups. This research has relied heavily on conventional null hypothesis significance tests that are familiar to psychologists, such as t‐tests and ANOVAs. As discounting research questions have become more complex by simultaneously focusing on within‐subject and between‐group differences, conventional statistical testing is often not appropriate for the obtained data. Generalized estimating equations (GEE) are one type of mixed‐effects model that are designed to handle autocorrelated data, such as within‐subject repeated‐measures data, and are therefore more appropriate for discounting data. To determine if GEE provides similar results as conventional statistical tests, we compared the techniques across 2,000 simulated data sets. The data sets were created using a Monte Carlo method based on an existing data set. Across the simulated data sets, the GEE and the conventional statistical tests generally provided similar patterns of results. As the GEE and more conventional statistical tests provide the same pattern of result, we suggest researchers use the GEE because it was designed to handle data that has the structure that is typical of discounting data.  相似文献   

20.
In a sample of 425 subjects, pure-tone hearing thresholds between the right and left ears were shown to have an average correlation of .885 (or .783 with age partialed out). This high interaural correlation is shown to invalidate the experimental procedure of entering data on the basis of "ears," where each subject can contribute one or two audiograms to the data pool, since such aggregation is demonstrated to produce spuriously high levels of apparent statistical significance in inferential statistical tests.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号