JITP 2011: The Future of Computational Social Science
 

Publication Date

2011

Abstract

Text is becoming a central source of data for social science research. With advances in digitization and open records practices, the central challenge has in large part shifted away from availability to usability. Automated text classification methodologies are becoming increasingly important within political science because they hold the promise of substantially reducing the costs of converting text to data for a variety of tasks. In this paper, we consider a number of questions of interest to prospective users of supervised learning methods, which are appropriate to classification tasks where known categories are applied. For the right task, supervised learning methods can dramatically lower the costs associated with labeling large volumes of textual data while maintaining high reliability and accuracy. Information science researchers devote considerable attention to comparing the performance of supervised learning algorithms and different feature representations, but the questions posed are often less directly relevant to the practical concerns of social science researchers. The first question prospective social science users are likely to ask is — how well do such methods work? The second is likely to be — how much do they cost in terms of human labeling effort? Relatedly, how much do marginal improvements in performance cost? We address these questions in the context of a particular dataset — the Congressional Bills Project — which includes more than 400,000 labeled bill titles (19 policy topics). This corpus also provides opportunities to experiment with varying sample sizes and sampling methodologies. We are ultimately able to locate an accuracy/efficiency sweet spot of sorts for this dataset by leveraging results generated by an ensemble of supervised learning algorithms.