Off-campus UMass Amherst users: To download campus access dissertations, please use the following link to log into our proxy server with your UMass Amherst user name and password.

Non-UMass Amherst users: Please talk to your librarian about requesting this dissertation through interlibrary loan.

Dissertations that have an embargo placed on them will not be available to anyone until the embargo expires.

Date of Award


Access Type

Campus Access

Document type


Degree Name

Doctor of Philosophy (PhD)

Degree Program

Computer Science

First Advisor

W. Bruce Croft

Second Advisor

James Allan

Third Advisor

David A. Smith

Subject Categories

Computer Sciences


Query reformulation modifies the original query with aim of providing a better representation of a user's information need and consequently improving the retrieval performance. Previous reformulation models typically generate words and phrases related to the original query, but do not consider how these words and phrases would fit together in realistic or actual queries. Some recent work on web search studies specific reformulation operations, but ignores how to combine different operations within the same framework. Furthermore, little research considers the reformulation model and the retrieval model from a joint perspective.

In this dissertation, a novel framework is proposed that models reformulation as a distribution of reformulated queries, where each reformulated query is associated with a probability indicating its importance. On one hand, this framework considers a reformulated query as the basic unit and can capture the important query-level dependencies between words and phrases in a realistic or actual query. On the other hand, since a reformulated query is the output of applying a single or multiple reformulation operations, this framework combines different operations such as query segmentation, query substitution and query deletion within the same framework. Moreover, a retrieval model is considered as an integrated part of this framework, which considers the reformulation model and the retrieval model jointly.

Specifically, the query distribution framework consists of three major components, which are query generation, probability estimation and retrieval. For query generation, we generate the reformulated queries that are semantically related to the original query using different operations. For probability estimation, we estimate the probability assigned to each reformulated query by directly optimizing the retrieval performance. For retrieval, the retrieval scores from each reformulated query are combined together and the probabilities are used as the combination weights.

Furthermore, in order to model the relationships between the reformulated queries, we extend the standard query distribution model to the hierarchical query distribution. The hierarchial query distribution model transforms the original query into a reformulation tree, where each path of the tree models a sequence of generating reformulated queries. A stage-based probability estimation approach is proposed to capture the relationships between queries and directly optimize the retrieval performance.

Several implementations of the query distribution model are designed for different types of queries and applications including short keyword queries, verbose queries, natural language questions and patent applications. Experiments on TREC collections show that the query distribution model significantly and consistently outperforms the state-of-the-art techniques.