Loading...
Thumbnail Image
Publication

Modeling Cross-Lingual Knowledge in Multilingual Information Retrieval Systems

Citations
Altmetric:
Abstract
In many search scenarios, language can become a barrier to comprehensively fulfilling users’ information needs. An Information Retrieval (IR) system equipped with an extra component of language translation is capable of mapping words in different languages, enabling it to retrieve documents according to the user's query regardless of the language in which the query and documents are expressed. Effectively incorporating multilingual knowledge is the key to building the translation component. Such knowledge can be obtained from dictionaries, machine translation modules, or multilingual pre-trained language models. For these different forms of multilingual knowledge, we present cross-lingual knowledge injection, transfer, and language debiasing techniques to enhance the effectiveness of Cross-lingual Information Retrieval (CLIR) and Multilingual Information Retrieval (MLIR). Specifically, by utilizing multilingual knowledge at various levels—from individual word translations to parallel and non-parallel corpora—we develop new model architectures and training goals tailored for information retrieval tasks across diverse linguistic settings. First, we introduce a mixed attention Transformer layer, which augments mutually translated words between query and document into the attention matrix and investigates its effectiveness on CLIR tasks. Next, we study cross-lingual transfer in the IR models and demonstrate a knowledge distillation framework to address the data scarcity problem in model training and improve retrieval effectiveness involving low-resource languages. Then, we focus on a special setting in MLIR, where the query is in one language, and the collection is a mixture of languages. To address the problem of inconsistent ranking results between languages, we design an encoder-decoder model that maps document representations from different languages into the same embedding space. We also present a decomposable soft prompt to capture unique and shared properties across languages. Finally, we introduce a language debiasing method to identify and remove linguistic features from a multilingual embedding space. This approach significantly diminishes the necessity for parallel data in constructing MLIR models, allowing for using non-parallel data instead. By reducing language-specific factors from the training process, we improve the retrieval effectiveness for all linguistic settings in retrieval tasks (e.g., monolingual, cross-lingual, and multilingual), thereby facilitating language-agnostic information retrieval.
Type
Dissertation (Open Access)
Date
2024-09
Publisher
License
Attribution 4.0 International
Attribution 4.0 International
License
http://creativecommons.org/licenses/by/4.0/
Research Projects
Organizational Units
Journal Issue
Embargo Lift Date
Publisher Version
Embedded videos
Related Item(s)