Thumbnail Image

An Assessment of The Nonparametric Approach for Evaluating The Fit of Item Response Models

As item response theory (IRT) has developed and is widely applied, investigating the fit of a parametric model becomes an important part of the measurement process when implementing IRT. The usefulness and successes of IRT applications rely heavily on the extent to which the model reflects the data, so it is necessary to evaluate model-data fit by gathering sufficient evidence before any model application. There is a lack of promising solutions on the detection of model misfit in IRT. In addition, commonly used fit statistics are not satisfactory in that they often do not possess desirable statistical properties and lack a means of examining the magnitude of misfit (e.g., via graphical inspections). In this dissertation, a newly-proposed nonparametric approach, RISE was thoroughly and comprehensively studied. Specifically, the purposes of this study are to (a) examine the promising fit procedure, RISE, (b) compare the statistical properties of RISE with that of the commonly used goodness-of-fit procedures, and (c) investigate how RISE may be used to examine the consequences of model misfit. To reach the above-mentioned goals, both a simulation study and empirical study were conducted. In the simulation study, four factors including ability distribution, sample size, test length and model were varied as the factors which may influence the performance of a fit statistic. The results demonstrated that RISE outperformed G2 and S-X2 in that it controlled Type I error rates and provided adequate power under all conditions. In the empirical study, the three fit statistics were applied to one empirical data and the misfitting items were flagged. RISE and S-X2 detected reasonable numbers of misfitting items while G2 detected almost all items when sample size is large. To further demonstrate an advantage of RISE, the residual plot on each misfitting item was shown. Compared to G2 and S-X2, RISE gave a much clearer picture of the location and magnitude of misfit for each misfitting item. Other than statistical properties and graphical displays, the score distribution and test characteristic curve (TCC) were investigated as model misfit consequence. The results indicated that for the given data, there was no practical consequence on classification before and after replacement of misfitting items detected by three fit statistics.
Research Projects
Organizational Units
Journal Issue
Publisher Version
Embedded videos