An investigation of the linguistic characteristics of Japanese information retrieval

Hideo Fujii, University of Massachusetts Amherst


This dissertation examines and demonstrates the effective use of linguistic knowledge in information retrieval (IR) technology. This linguistic IR research has a long history of serious but unfortunately often unsuccessful endeavors, but our retrieval experiments generally confirmed a significant performance improvement by these linguistic techniques. These experiments were realized by using a Japanese corpus. Thus, this research also serves as a case study of "linguistic information retrieval" for Japanese, as opposed to English which has traditionally been the predominant language of study. The methodology which was taken in this study is called grammatical paraphrasing paradigm for the query formulation to translate a formal grammatical relationship into a retrieval strategy. To realize this paradigm, based on the theory of generative grammar, we developed a class of query strategies to be applied to a sentence in a base query having various valency structures such as transitivity or intransitivity in lexicon, or causativization or passivization in syntax. We call this class of strategies valency control strategies. The most distinctive advantage of this method is the capability to draw two contingent sets of dichotomous views. The first view is the valency dichotomy that reveals the difference in strategic gain between the monovalent (i.e., intransitive and passive) and bivalent (i.e., transitive and causative) strategies. The second view is the dichotomy within a system of linguistic components, where lexical and syntactical modules have separate retrieval mechanisms. After developing the general framework of valency control strategies from a linguistic background, especially involving the phenomenon of transitivity alternations which exist extensively in Japanese, we examined its effectiveness in a series of experiments. We found the following three uniquely important results. First, the overall result showed that most valency control query strategies considerably improved the precision. This means that linguistic knowledge is a highly valuable knowledge source in information retrieval. Second, in the valency dichotomy, the bivalent strategy improved the performance, but the monovalent method degraded it. This result indicates the usefulness of formally definable grammatical strategies in information retrieval. Third, in the linguistic module dichotomy, despite the conventional wisdom which emphasizes the local morpho-lexical information, the syntactical method was effective as well as the lexical method. Two additional experiments on potentialization and verbal nouns were carried out, as well. The potential query strategy on verbs, which does not change the valency, showed a moderate performance improvement between bivalent and monovalent. The performance of verbal noun strategies was not as encouraging as that of verb strategies. The genitive verbal noun strategy showed a particularly clear degradation, which is probably a reflection of past data in literature showing that phrase recognition achieved only limited retrieval improvement. Finally, this research also has a strong practical implication. We had two sets of experiments--one the relevance feedback method, the other the automatic query generation method. Our results showed that the automatic method works roughly as well as the relevance feedback method. This suggests that our method has significant practical applications because it does not rely on relevance information to improve the query performance.

Subject Area

Computer science

Recommended Citation

Fujii, Hideo, "An investigation of the linguistic characteristics of Japanese information retrieval" (1998). Doctoral Dissertations Available from Proquest. AAI9823737.