Loading...
Application of Item Response Theory Models to the Algorithmic Detection of Shift Errors on Paper and Pencil Tests
Citations
Altmetric:
Abstract
On paper and pencil multiple choice tests, the potential for examinees to mark their answers in incorrect locations presents a serious threat to the validity of test score interpretations. When an examinee skips one or more items (i.e., answers out of sequence) but fails to accurately reflect the size of that skip on their answer sheet, that can trigger a string of misaligned responses called shift errors. Shift errors can result in correct answers being marked as incorrect, leading to possible underestimation of an examinee's true ability. Despite movement toward computerized testing in recent years, paper and pencil multiple choice tests are still pervasive in many high stakes assessment settings, including K 12 testing (e.g., MCAS) and college entrance exams (e.g., SAT), leaving a continuing need to address issues that arise within this format. Techniques for detecting aberrant response patterns are well established but do little to recognize reasons for the aberrance, limiting options for addressing the misfitting patterns. While some work has been done to detect and address specific forms of aberrant response behavior, little has been done in the area of shift error detection, leaving great room for improvement in addressing this source of aberrance. The opportunity to accurately detect construct irrelevant errors and either adjust scores to more accurately reflect examinee ability or flag examinees with inaccurate scores for removal from the dataset and retesting would improve the validity of important decisions based on test scores, and could positively impact model fit by allowing for more accurate item parameter and ability estimation. The purpose of this study is to investigate new algorithms for shift error detection that employ IRT models for probabilistic determination as to whether misfitting patterns are likely to be shift errors. The study examines a matrix of detection algorithms, probabilistic models, and person parameter methods, testing combinations of these factors for their selectivity (i.e., true positives vs. false positives), sensitivity (i.e., true shift errors detected vs. undetected), and robustness to parameter bias, all under a carefully manipulated, multifaceted simulation environment. This investigation attempts to provide answers to the following questions, applicable across detection methods, bias reduction procedures, shift conditions, and ability levels, but stated generally as: 1) How sensitively and selectively can an IRT based probabilistic model detect shift error across the full range of probabilities under specific conditions?, 2) How robust is each detection method to the parameter bias introduced by shift error?, 3) How well does the detection method detect shift errors compared to other, more general, indices of person fit?, 4) What is the impact on bias of making proposed corrections to detected shift errors?, and 4) To what extent does shift error, as detected by the method, occur within an empirical data set? Results show that the proposed methods can indeed detect shift errors at reasonably high detection rates with only a minimal number of false positives, that detection improves when detecting longer shift errors, and that examinee ability is a huge determinant factor in the effectiveness of the shift error detection techniques. Though some detection ability is lost to person parameter bias, when detecting all but the shortest shift errors, this loss is minimal. Application to empirical data also proved effective, though some discrepancies in projected total counts suggest that refinements in the technique are required. Use of a person fit statistic to detect examinees with shift errors was shown to be completely ineffective, underscoring the value of shift error specific detection methods.
Type
dissertation
Date
2013-09