Off-campus UMass Amherst users: To download dissertations, please use the following link to log into our proxy server with your UMass Amherst user name and password.

Non-UMass Amherst users, please click the view more button below to purchase a copy of this dissertation from Proquest.

(Some titles may also be available free of charge in our Open Access Dissertation Collection, so please check there first.)

Information extraction as a basis for portable text classification systems

Ellen Michele Riloff, University of Massachusetts Amherst


Knowledge-based natural language processing systems have achieved good success with many tasks, but they often require many person-months of effort to build an appropriate knowledge base. As a result, they are not portable across domains. This knowledge-engineering bottleneck must be addressed before knowledge-based systems will be practical for real-world applications. This dissertation addresses the knowledge-engineering bottleneck for a natural language processing task called "information extraction". A system called AutoSlog is presented which automatically constructs dictionaries for information extraction, given an appropriate training corpus. In the domain of terrorism, AutoSlog created a dictionary using a training corpus and five person-hours of effort that achieved 98% of the performance of a hand-crafted dictionary that took approximately 1500 person-hours to build. This dissertation also describes three algorithms that use information extraction to support high-precision text classification. As more information becomes available on-line, intelligent information retrieval will be crucial in order to navigate the information highway efficiently and effectively. The approach presented here represents a compromise between keyword-based techniques and in-depth natural language processing. The text classification algorithms classify texts with high accuracy by using an underlying information extraction system to represent linguistic phrases and contexts. Experiments in the terrorism domain suggest that increasing the amount of linguistic context can improve performance. Both AutoSlog and the text classification algorithms are evaluated in three domains: terrorism, joint ventures, and microelectronics. An important aspect of this dissertation is that AutoSlog and the text classification systems can be easily ported across domains.

Subject Area

Computer science|Artificial intelligence

Recommended Citation

Riloff, Ellen Michele, "Information extraction as a basis for portable text classification systems" (1994). Doctoral Dissertations Available from Proquest. AAI9510533.