Off-campus UMass Amherst users: To download campus access dissertations, please use the following link to log into our proxy server with your UMass Amherst user name and password.
Non-UMass Amherst users: Please talk to your librarian about requesting this dissertation through interlibrary loan.
Dissertations that have an embargo placed on them will not be available to anyone until the embargo expires.
Author ORCID Identifier
Open Access Dissertation
Doctor of Philosophy (PhD)
Education (also CAGS)
Year Degree Awarded
Month Degree Awarded
Ronald K. Hambleton
Educational Assessment, Evaluation, and Research
Previous studies focused on reporting subscores themselves; however, most proficiency tests are criterion-referenced, reporting classification consistency and accuracy of student placement into performance categories within each subdomain are thus more suitable. Therefore, the primary purpose of this study was to investigate the decision consistency (DC) and decision accuracy (DA) of student placement into performance categories within each of the subdomains measured by the test of interest. A second purpose of this study was to compare the performance of five subscoring methods under some realistic conditions in terms of subscore reliability and classification. To do so, a simulation study was designed and factors (number of subtests, subtest length, subtest inter-correlations, & location of cut scores) related to DC and DA were investigated.
Results of subscore reliability and classification showed that subscore reliability and classification estimates were a function of the number of subtests, subtest length, subtest inter-correlations, location of cut scores, and scoring methods. Specifically, with respect to subscore reliability, results indicated that with item qualities similar to the ones used in this study (with mean item discrimination about1.5, mean item difficulty about 0, & mean guessing about.15), when there are 20 items or more on a subtest, the reliability estimates of raw and UIRT subscores were quite reasonable (in the range of .80 to .90). Therefore, it appears that there is no need to augment the subscores. Moreover, the reliability estimates of raw and UIRT subscores for any 5- and 10-item subtests were barely acceptable (in the range of .60 to .70). Even after augmentation, the subscore reliability estimates became reasonable (around .80 or above) only when there are 10 items on a subtest and the subtest inter-correlations were .80 and .90.
Results of subscore classification indicated that with item qualities similar to the ones used in this study, when there are approximately 20 items or more on a subtest, it does not matter whether or not to apply augmentation methods because DCs and DAs were approximately .80 or higher for the non-augmented subscoring methods. It may seem unnecessary to apply the augmentation methods. Moreover, when there is one cut score at the center of the test score distribution, the classification consistency and accuracy estimates were in the ranges of .70 for the 5-item subtests, for both non-augmented and augmented subscores. If the cut score was further away from the center of test score distribution, the classification consistency and accuracy would become higher for the 5-item subtest. This means that even though the 5-item subtest scores may not be reliable by themselves, but the classification using these scores could be accurate, depending on where the cut score is located. When there are two cut scores, the DCs and DAs for the 10-item subtests may also be high enough, depending on the location of the cut scores.
Fan, Fen, "Subscore Reliability and Classification Consistency: A Comparison of Five Methods" (2016). Doctoral Dissertations. 857.