Off-campus UMass Amherst users: To download dissertations, please use the following link to log into our proxy server with your UMass Amherst user name and password.

Non-UMass Amherst users, please click the view more button below to purchase a copy of this dissertation from Proquest.

(Some titles may also be available free of charge in our Open Access Dissertation Collection, so please check there first.)

Cluster-based retrieval from a language modeling perspective

Xiaoyong Liu, University of Massachusetts Amherst

Abstract

The standard approach to document retrieval is to assume that the relevance of documents could be assessed independently. The fact that a document is relevant does not contribute to predicting the relevance of a closely-related document. Cluster-based retrieval, on the other hand, assumes that the probability of relevance of a document should depend on the relevance of other similar documents to the same query. The goal is to find the best group of documents. The most common approach to cluster-based retrieval, which was proposed in the 1970s, is to retrieve one or more clusters in their entirety to a query. Research in this area has suggested that "optimal" clusters exist that, if retrieved, would yield very large improvements in effectiveness relative to document retrieval. However, no real retrieval strategy has achieved this result. Except for precision-oriented searches on very small data sets, document retrieval is found to be generally more effective. There has been a resurgence of research in cluster-based retrieval in the past few years including our own efforts in this area. The general approach is to use clusters as a form of document smoothing. Studies have shown that clusters can indeed improve retrieval performance automatically on modern test collections and the language modeling framework is an effective probabilistic retrieval framework for studying this type of problems. This thesis revisits the problem of retrieving the best group of documents, from the language-modeling perspective. We study both cluster smoothing and cluster retrieval. We analyze the advantages and disadvantages of a range of representation techniques, derive features that characterize good document clusters, and develop new probabilistic representations that capture the identified features. An extensive empirical evaluation is provided for various techniques proposed in this work. We find that whether good document clusters could be successfully identified or utilized by an IR system largely depends on how they are represented. Both the CBDM model for cluster smoothing and the geometric mean representation for cluster retrieval are shown to be effective approaches for cluster-based retrieval.

Subject Area

Computer science

Recommended Citation

Liu, Xiaoyong, "Cluster-based retrieval from a language modeling perspective" (2008). Doctoral Dissertations Available from Proquest. AAI3315531.
https://scholarworks.umass.edu/dissertations/AAI3315531

Share

COinS