Loading...
Context-Aware Query and Document Representation in Information Retrieval Systems
Citations
Altmetric:
Abstract
Input representation has a major impact on the effectiveness of Information Retrieval (IR) systems. Further, developing a context-aware input representation for IR systems is crucial to answering user's complicated information need. The goal of this work is to take advantage of the \textit{contextual features} to represent the query and document to enhance the information retrieval systems performance. We focus on three sources of \textit{contextual} features: 1. Entities, defined as things or concepts that exist in the world; 2. Context within pseudo-relevant feedback document in IR systems; and 3. Context within example documents provided by user as the IR system's input.
We first introduce a dense entity representation based on the relationships between an entity and other entities described within its summary. We explore its use in the entity ranking task by representing both queries and documents using this model. By integrating this ranking methodology with a term-based ranking method, we achieved statistically significant improvements over the term-based ranking approach. Further, we developed a retrieval model that merges term-based language model retrieval, word-based embedding ranking, and entity-based embedding ranking, resulting in the best performance. Additionally, we introduce an entity-based query expansion framework employing local and global entity knowledge sources; i.e. corpus-based indexed entities and the summary-expanded entity embedding. Our results demonstrate our entity-based expansion framework outperforms the learned combination of word-based expansion techniques.
Then we focus on leveraging the context of pseudo-relevance feedback documents (PRF) for ranking relevant terms to the user's query. To achieve this, we utilize transformer models, which excel at capturing context through their attention mechanisms, and expand the query with top-ranked terms. We propose both unsupervised and supervised frameworks. Our unsupervised model employs transformer-generated embeddings to calculate the similarity between a term (from a PRF document) and the query, while considering the term's context within the document. Our results demonstrate that this unsupervised approach outperforms static embedding-based expansion models and performs competitively with state-of-the-art word-based feedback models, relevance model variants, across multiple collections. The supervised framework approaches query expansion as a binary classification task, aiming to identify terms within the PRF documents relevant to the query. We utilize transformer models in a cross-attention architecture to predict relevancy scores for candidate terms. This supervised approach yields performance comparable to term frequency-based feedback models, relevance model variant. Moreover, combining it with the relevance model results in even greater improvement than either model used independently.
Finally, we concentrate on leveraging the context of the example documents provided by the user in the query-by-example retrieval problem to formulate a latent query that represents the user's information needs. We construct three query-by-example datasets and develop several transformer-based re-ranking architectures. Our Passage Relevancy Representation by Multiple Examples (PRRIME) overcomes BERT's context window limitations by segmenting query example and candidate documents into passages. It then trains an end-to-end neural ranking architecture to aggregate passage-level relevance representations, demonstrating improvement over the first-stage ranking framework. Additionally, we explore a cross-encoder reranking architecture using the Longformer transformer model for query-by-example retrieval, aiming to capture cross-text relationship, particularly aligning or linking matching information elements across documents. This shows statistically significant improvement on the test set of the dataset which it is trained on but performs not as well as the baseline on the other two datasets which have limited fine-tuning data, indicating limited knowledge transferability. Finally, we investigate a dual-encoder reranking architecture that learns query and document representations through an auxiliary
training paradigm. It uses query prediction as an auxiliary task alongside the ranking objective as the main task. It outperforms both the initial retrieval stage and the single-loss training method - i.e training the dual encoders solely with a ranking objective.
Type
Dissertation (Open Access)
Date
2024-09
Publisher
Degree
Advisors
License
License
http://creativecommons.org/licenses/by/4.0/