首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
The point estimate of sample coefficient alpha may provide a misleading impression of the reliability of the test score. Because sample coefficient alpha is consistently biased downward, it is more likely to yield a misleading impression of poor reliability. The magnitude of the bias is greatest precisely when the variability of sample alpha is greatest (small population reliability and small sample size). Taking into account the variability of sample alpha with an interval estimator may lead to retaining reliable tests that would be otherwise rejected. Here, the authors performed simulation studies to investigate the behavior of asymptotically distribution-free (ADF) versus normal-theory interval estimators of coefficient alpha under varied conditions. Normal-theory intervals were found to be less accurate when item skewness >1 or excess kurtosis >1. For sample sizes over 100 observations, ADF intervals are preferable, regardless of item skewness and kurtosis. A formula for computing ADF confidence intervals for coefficient alpha for tests of any size is provided, along with its implementation as an SAS macro.  相似文献   

2.
Abstract

This paper evaluated multilevel reliability measures in two-level nested designs (e.g., students nested within teachers) within an item response theory framework. A simulation study was implemented to investigate the behavior of the multilevel reliability measures and the uncertainty associated with the measures in various multilevel designs regarding the number of clusters, cluster sizes, and intraclass correlations (ICCs), and in different test lengths, for two parameterizations of multilevel item response models with separate item discriminations or the same item discrimination over levels. Marginal maximum likelihood estimation (MMLE)-multiple imputation and Bayesian analysis were employed to evaluate the accuracy of the multilevel reliability measures and the empirical coverage rates of Monte Carlo (MC) confidence or credible intervals. Considering the accuracy of the multilevel reliability measures and the empirical coverage rate of the intervals, the results lead us to generally recommend MMLE-multiple imputation. In the model with separate item discriminations over levels, marginally acceptable accuracy of the multilevel reliability measures and empirical coverage rate of the MC confidence intervals were found in a limited condition, 200 clusters, 30 cluster size, .2 ICC, and 40 items, in MMLE-multiple imputation. In the model with the same item discrimination over levels, the accuracy of the multilevel reliability measures and the empirical coverage rate of the MC confidence intervals were acceptable in all multilevel designs we considered with 40 items under MMLE-multiple imputation. We discuss these findings and provide guidelines for reporting multilevel reliability measures.  相似文献   

3.
Composite measures play an important role in psychology and related disciplines. Composite measures almost always have error. Correspondingly, it is important to understand the reliability of the scores from any particular composite measure. However, the point estimates of the reliability of composite measures are fallible and thus all such point estimates should be accompanied by a confidence interval. When confidence intervals are wide, there is much uncertainty in the population value of the reliability coefficient. Given the importance of reporting confidence intervals for estimates of reliability, coupled with the undesirability of wide confidence intervals, we develop methods that allow researchers to plan sample size in order to obtain narrow confidence intervals for population reliability coefficients. We first discuss composite reliability coefficients and then provide a discussion on confidence interval formation for the corresponding population value. Using the accuracy in parameter estimation approach, we develop two methods to obtain accurate estimates of reliability by planning sample size. The first method provides a way to plan sample size so that the expected confidence interval width for the population reliability coefficient is sufficiently narrow. The second method ensures that the confidence interval width will be sufficiently narrow with some desired degree of assurance (e.g., 99% assurance that the 95% confidence interval for the population reliability coefficient will be less than W units wide). The effectiveness of our methods was verified with Monte Carlo simulation studies. We demonstrate how to easily implement the methods with easy-to-use and freely available software.  相似文献   

4.
The common way to calculate confidence intervals for item response theory models is to assume that the standardized maximum likelihood estimator for the person parameter θ is normally distributed. However, this approximation is often inadequate for short and medium test lengths. As a result, the coverage probabilities fall below the given level of significance in many cases; and, therefore, the corresponding intervals are no longer confidence intervals in terms of the actual definition. In the present work, confidence intervals are defined more precisely by utilizing the relationship between confidence intervals and hypothesis testing. Two approaches to confidence interval construction are explored that are optimal with respect to criteria of smallness and consistency with the standard approach.  相似文献   

5.
Reliability is one of the most important aspects of testing in educational and psychological measurement. The construction of confidence intervals for reliability coefficients has important implications for evaluating the accuracy of the sample estimate of reliability and for comparing different tests, scoring rubrics, or training procedures for raters or observers. The present simulation study evaluated and compared various parametric and non-parametric methods for constructing confidence intervals of coefficient alpha. Six factors were manipulated: number of items, number of subjects, population coefficient alpha, deviation from essentially parallel condition, item response distribution and type. The coverage and width of different confidence intervals were compared across simulation conditions.  相似文献   

6.
Classical reliability theory assumes that individuals have identical true scores on both testing occasions, a condition described as stable. If some individuals' true scores are different on different testing occasions, described as unstable, the estimated reliability can be misleading. A model called stable unstable reliability theory (SURT) frames stability or instability as an empirically testable question. SURT assumes a mixed population of stable and unstable individuals in unknown proportions, with w(i) the probability that individual i is stable. w(i) becomes i's test score weight which is used to form a weighted correlation coefficient r(w) which is reliability under SURT. If all w(i) = 1 then r(w) is the classical reliability coefficient; thus classical theory is a special case of SURT. Typically r(w) is larger than the conventional reliability r, and confidence intervals on true scores are typically shorter than conventional intervals. r(w) is computed with routines in a publicly available R package.  相似文献   

7.
We explore the justification and formulation of a four‐parameter item response theory model (4PM) and employ a Bayesian approach to recover successfully parameter estimates for items and respondents. For data generated using a 4PM item response model, overall fit is improved when using the 4PM rather than the 3PM or the 2PM. Furthermore, although estimated trait scores under the various models correlate almost perfectly, inferences at the high and low ends of the trait continuum are compromised, with poorer coverage of the confidence intervals when the wrong model is used. We also show in an empirical example that the 4PM can yield new insights into the properties of a widely used delinquency scale. We discuss the implications for building appropriate measurement models in education and psychology to model more accurately the underlying response process.  相似文献   

8.
This paper is a presentation of an essential part of the sampling theory of the error variance and the standard error of measurement. An experimental assumption is that several equivalent tests with equal variances are available. These may be either final forms of the same test or obtained by dividing one test into several parts. The simple model of independent and normally distributed errors of measurement with zero mean is employed. No assumption is made about the form of the distributions of true and observed scores. This implies unrestricted freedom in defining the population. First, maximum-likelihood estimators of the error variance and the standard error of measurement are obtained, their sampling distributions given, and their properties investigated. Then unbiased estimators are defined and their distributions derived. The accuracy of estimation is given special consideration from various points of view. Next, rigorous statistical tests are developed to test hypotheses about error variances on the basis of one and two samples. Also the construction of confidence intervals is treated. Finally, Bartlett's test of homogeneity of variances is used to provide a multi-sample test of equality of error variances.  相似文献   

9.
Assuming item parameters on a test are known constants, the reliability coefficient for item response theory (IRT) ability estimates is defined for a population of examinees in two different ways: as (a) the product-moment correlation between ability estimates on two parallel forms of a test and (b) the squared correlation between the true abilities and estimates. Due to the bias of IRT ability estimates, the parallel-forms reliability coefficient is not generally equal to the squared-correlation reliability coefficient. It is shown algebraically that the parallel-forms reliability coefficient is expected to be greater than the squared-correlation reliability coefficient, but the difference would be negligible in a practical sense.  相似文献   

10.
In item response theory, the classical estimators of ability are highly sensitive to response disturbances and can return strongly biased estimates of the true underlying ability level. Robust methods were introduced to lessen the impact of such aberrant responses on the estimation process. The computation of asymptotic (i.e., large‐sample) standard errors (ASE) for these robust estimators, however, has not yet been fully considered. This paper focuses on a broad class of robust ability estimators, defined by an appropriate selection of the weight function and the residual measure, for which the ASE is derived from the theory of estimating equations. The maximum likelihood (ML) and the robust estimators, together with their estimated ASEs, are then compared in a simulation study by generating random guessing disturbances. It is concluded that both the estimators and their ASE perform similarly in the absence of random guessing, while the robust estimator and its estimated ASE are less biased and outperform their ML counterparts in the presence of random guessing with large impact on the item response process.  相似文献   

11.
The point-biserial correlation is a commonly used measure of effect size in two-group designs. New estimators of point-biserial correlation are derived from different forms of a standardized mean difference. Point-biserial correlations are defined for designs with either fixed or random group sample sizes and can accommodate unequal variances. Confidence intervals and standard errors for the point-biserial correlation estimators are derived from the sampling distributions for pooled-variance and separate-variance versions of a standardized mean difference. The proposed point-biserial confidence intervals can be used to conduct directional two-sided tests, equivalence tests, directional non-equivalence tests, and non-inferiority tests. A confidence interval for an average point-biserial correlation in meta-analysis applications performs substantially better than the currently used methods. Sample size formulas for estimating a point-biserial correlation with desired precision and testing a point-biserial correlation with desired power are proposed. R functions are provided that can be used to compute the proposed confidence intervals and sample size formulas.  相似文献   

12.
Effects of the testing situation on item responding: cause for concern   总被引:6,自引:0,他引:6  
The effects of faking on personality test scores have been studied previously by comparing (a) experimental groups instructed to fake or answer honestly, (b) subgroups created from a single sample of applicants or nonapplicants by using impression management scores, and (c) job applicants and nonapplicants. In this investigation, the latter 2 methods were used to study the effects of faking on the functioning of the items and scales of the Sixteen Personality Factor Questionnaire. A variety of item response theory methods were used to detect differential item/test functioning, interpreted as evidence of faking. The presence of differential item/test functioning across testing situations suggests that faking adversely affects the construct validity of personality scales and that it is problematic to study faking by comparing groups defined by impression management scores.  相似文献   

13.
This article presents a generalization of the Score method of constructing confidence intervals for the population proportion (E. B. Wilson, 1927) to the case of the population mean of a rating scale item. A simulation study was conducted to assess the properties of the Score confidence interval in relation to the traditional Wald (A. Wald, 1943) confidence interval under a variety of conditions, including sample size, number of response options, extremeness of the population mean, and kurtosis of the response distribution. The results of the simulation study indicated that the Score interval usually outperformed the Wald interval, suggesting that the Score interval is a viable method of constructing confidence intervals for the population mean of a rating scale item.  相似文献   

14.
The published norms for the Depression Anxiety and Stress Scale (DASS) give results for clinical populations that largely fall in the severe to very severe categories. As a result, within this population, the ability to compare the comparative contributions of the underlying emotional components is reduced. The present study presents results from a large general psychiatric outpatient population and provides percentile norms with confidence intervals for both the original DASS and the shorter 21‐item form. It is noted that both forms have high validity but that the correlations between scales are higher than those reported in non‐clinical populations. There was little variation between sexes but some variation of results with age with both younger and older cohorts having lower scores except for the Stress scale where there were higher scores in the older group. There is some evidence of a ceiling effect in the Depression and Stress scales. It was noted that nearly a quarter of patient scores fell within the originally defined normal range suggesting that the DASS would not be a particularly sensitive instrument in its previously reported use as a screening instrument for psychiatric illness.  相似文献   

15.
Confidence intervals (CIs) are fundamental inferential devices which quantify the sampling variability of parameter estimates. In item response theory, CIs have been primarily obtained from large-sample Wald-type approaches based on standard error estimates, derived from the observed or expected information matrix, after parameters have been estimated via maximum likelihood. An alternative approach to constructing CIs is to quantify sampling variability directly from the likelihood function with a technique known as profile-likelihood confidence intervals (PL CIs). In this article, we introduce PL CIs for item response theory models, compare PL CIs to classical large-sample Wald-type CIs, and demonstrate important distinctions among these CIs. CIs are then constructed for parameters directly estimated in the specified model and for transformed parameters which are often obtained post-estimation. Monte Carlo simulation results suggest that PL CIs perform consistently better than Wald-type CIs for both non-transformed and transformed parameters.  相似文献   

16.
Cronbach's alpha系数作为信度估计指标存在诸多弊端.为了克服其不足,研究者提出了多种信度估计,而流行的统计软件尚未直接提供这些参数,以致在实践中并未被广泛采用.为了缩小理论和实践的差距,文章通过具体实例给出几种常用的信度估计(合成信度,单个指标信度和ωh)的Mplus程序.  相似文献   

17.
Liu  Yang  Hannig  Jan  Pal Majumder  Abhishek 《Psychometrika》2019,84(3):701-718
Psychometrika - In applications of item response theory (IRT), it is often of interest to compute confidence intervals (CIs) for person parameters with prescribed frequentist coverage. The...  相似文献   

18.
A rationale for, and data from, a trial of a theory of item generation by algorithms whose origins are cognitive models of task performance are presented. Since Spearman (1904), intelligence has been operationally defined and assessed in human subjects by administering identical test items whose content and order have been fixed only after empirical iterations. In our approach, intelligence is ostensively defined by theoretically determined algorithms used for item construction and presentation. Knowledge of what cognitive factors limit human performance makes it possible to vary within tightly specified parameters those features of the tasks that contribute to difficulty, which we call radicals, to let those components of the tasks that do not contribute to difficulty vary randomly, and to counterbalance aspects of answer production that might induce biases of response. Empirical data are based on the generation of five different short tests demanding only functional literacy as a prerequisite for their execution. Four parallel forms of each test were administered to young male Army recruits whose scores were collated with their Army Entrance Test results, which were not previously known to us. Results show that the parallel, algorithm-generated item sets are statistically invariant, which item generation theory demands; and that the individual tests differentially predict Army Entrance Test scores. We conclude that IQ test performances are parsimoniously explained by individual differences in encoding, comparison and reconstructive memory processes.  相似文献   

19.
This study in parametric test theory deals with the statistics of reliability estimation when scores on two parts of a test follow a binormal distribution with equal (Case 1) or unequal (Case 2) expectations. In each case biased maximum-likelihood estimators of reliability are obtained and converted into unbiased estimators. Sampling distributions are derived. Second moments are obtained and utilized in calculating mean square errors of estimation as a measure of accuracy. A rank order of four estimators is established. There is a uniformly best estimator. Tables of absolute and relative accuracies are provided for various reliability parameters and sample sizes.  相似文献   

20.
基于经典测量理论标准参照测验的传统划界分数设置方法是等级评分或指定划界分数,划界分数设置的方法有待进一步拓展。Bookmark法是基于项目反应理论的划界分数设置方法,学科专家以测验材料的能力参数值为基础,依据掌握百分比分数与被试能力水平的定量关系,设置多重划界分数,相对于传统方法更高效、精确。作者评述了Bookmark法的基本原理和具体实施方法,分析了Bookmark法的应用前景,并对Bookmark法设置划界分数的信效度和标准误估计的研究作了评述。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号