Context-Aware Query and Document Representation in Information Retrieval Systems

Naseri, Shahrzad

Publication

Context-Aware Query and Document Representation in Information Retrieval Systems

Naseri, Shahrzad

Abstract

Input representation has a major impact on the effectiveness of Information Retrieval (IR) systems. Further, developing a context-aware input representation for IR systems is crucial to answering user's complicated information need. The goal of this work is to take advantage of the \textit{contextual features} to represent the query and document to enhance the information retrieval systems performance. We focus on three sources of \textit{contextual} features: 1. Entities, defined as things or concepts that exist in the world; 2. Context within pseudo-relevant feedback document in IR systems; and 3. Context within example documents provided by user as the IR system's input. We first introduce a dense entity representation based on the relationships between an entity and other entities described within its summary. We explore its use in the entity ranking task by representing both queries and documents using this model. By integrating this ranking methodology with a term-based ranking method, we achieved statistically significant improvements over the term-based ranking approach. Further, we developed a retrieval model that merges term-based language model retrieval, word-based embedding ranking, and entity-based embedding ranking, resulting in the best performance. Additionally, we introduce an entity-based query expansion framework employing local and global entity knowledge sources; i.e. corpus-based indexed entities and the summary-expanded entity embedding. Our results demonstrate our entity-based expansion framework outperforms the learned combination of word-based expansion techniques. Then we focus on leveraging the context of pseudo-relevance feedback documents (PRF) for ranking relevant terms to the user's query. To achieve this, we utilize transformer models, which excel at capturing context through their attention mechanisms, and expand the query with top-ranked terms. We propose both unsupervised and supervised frameworks. Our unsupervised model employs transformer-generated embeddings to calculate the similarity between a term (from a PRF document) and the query, while considering the term's context within the document. Our results demonstrate that this unsupervised approach outperforms static embedding-based expansion models and performs competitively with state-of-the-art word-based feedback models, relevance model variants, across multiple collections. The supervised framework approaches query expansion as a binary classification task, aiming to identify terms within the PRF documents relevant to the query. We utilize transformer models in a cross-attention architecture to predict relevancy scores for candidate terms. This supervised approach yields performance comparable to term frequency-based feedback models, relevance model variant. Moreover, combining it with the relevance model results in even greater improvement than either model used independently. Finally, we concentrate on leveraging the context of the example documents provided by the user in the query-by-example retrieval problem to formulate a latent query that represents the user's information needs. We construct three query-by-example datasets and develop several transformer-based re-ranking architectures. Our Passage Relevancy Representation by Multiple Examples (PRRIME) overcomes BERT's context window limitations by segmenting query example and candidate documents into passages. It then trains an end-to-end neural ranking architecture to aggregate passage-level relevance representations, demonstrating improvement over the first-stage ranking framework. Additionally, we explore a cross-encoder reranking architecture using the Longformer transformer model for query-by-example retrieval, aiming to capture cross-text relationship, particularly aligning or linking matching information elements across documents. This shows statistically significant improvement on the test set of the dataset which it is trained on but performs not as well as the baseline on the other two datasets which have limited fine-tuning data, indicating limited knowledge transferability. Finally, we investigate a dual-encoder reranking architecture that learns query and document representations through an auxiliary training paradigm. It uses query prediction as an auxiliary task alongside the ranking objective as the main task. It outperforms both the initial retrieval stage and the single-loss training method - i.e training the dual encoders solely with a ranking objective.

Type

Dissertation (Open Access)

Date

2024-09

Degree

Doctor of Philosophy (Ph.D.)

Advisors

Allan, James

License

Attribution 4.0 International
Attribution 4.0 International

cb

License

http://creativecommons.org/licenses/by/4.0/

Context-Aware Query and Document Representation in Information Retrieval Systems

Naseri, Shahrzad

Citations

Abstract

Type

Date

Publisher

Degree

Advisors

License

License

Files

Research Projects

Organizational Units

Journal Issue

Embargo Lift Date

URI

DOI

Publisher Version

Embedded videos

Collections

Related Item(s)