Off-campus UMass Amherst users: To download campus access dissertations, please use the following link to log into our proxy server with your UMass Amherst user name and password.

Non-UMass Amherst users: Please talk to your librarian about requesting this dissertation through interlibrary loan.

Dissertations that have an embargo placed on them will not be available to anyone until the embargo expires.

Author ORCID Identifier


Open Access Dissertation

Document Type


Degree Name

Doctor of Philosophy (PhD)

Degree Program

Computer Science

Year Degree Awarded


Month Degree Awarded


First Advisor

James Allan

Second Advisor

W. Bruce Croft

Third Advisor

Ramesh Sitaraman

Fourth Advisor

Evangelos Kanoulas

Subject Categories

Artificial Intelligence and Robotics | Databases and Information Systems


There are significant efforts toward developing better neural approaches for information retrieval problems. However, the vast majority of these studies are conducted using English-only data. In fact, trends and statistics of non-English content and users on the Internet show exponential growth and that novel information retrieval systems need to be language-agnostic; they need to bridge the language barrier between users and content, leverage data from high-resource settings for lower-resourced settings, and be able to extend to new languages and local markets easily. To this end, we focus on search and recommendation as two vital components of information systems. We explore some of the complex cross-lingual issues to help develop an understanding of the challenges that someone designing a neural Cross-Lingual Information Retrieval (CLIR) system will need to address.

We first introduce a contrastive analysis framework for simulating low-resource settings using higher-resourced ones---named Resource Scarcity Simulation (RSS). For this, we start with a true low-resource language and systematically down-sample a high-resource language's data to become an artificial low-resource language that is statistically similar to the true low-resource one. Given that obtaining extra resources in low-resource settings are extremely expensive, using our simulation framework one could study different possible solutions in the artificially created low-resource setting and extend the findings to the real low-resource problem. We focus on parallel translation corpora and aim to better understand the factors impacting the performance of CLIR systems.

We then focus on neural CLIR approaches by bridging the language gap. We show that these models are performing sub-optimally because typical Cross-Lingual Embeddings (CLE) "translate" query terms into related terms---i.e., terms that appear in a similar context---rather than synonyms in the target language. We introduce Smart Shuffling CLE, by focusing on distinguishing synonyms with related terms in the training of the embedding using a dictionary to guide the re-ordering of tokens in two translating sentences. We further show that our CLE method is able to significantly boost the performance of an off-the-shelf neural re-ranking model as well as a simple word-by-word query translation CLIR system. We follow up on this work by injecting the dictionary knowledge into the self-attention part of a pre-trained BERT-based ranking model and show a significant improvement in the retrieval performance.

Finally, we go beyond CLIR and study language-agnostic search and recommendation in the e-commerce domain. Due to a lack of experimental data in this area, we first collect and release XMarket, a large dataset covering 18 local e-commerce markets in 11 different languages. We focus on the market adaptation problem and using XMarket, we first study the problem of recommending relevant products to users in relatively resource-scarce markets by leveraging data from similar, richer in resources, auxiliary markets. Then, we further extend our findings toward a universal language-agnostic recommendation system by utilizing multilingual contents from multiple markets. Lastly, we construct a product search benchmark using our XMarket dataset and study language-agnostic product search performance across markets for single- and cross-market training scenarios. Our experiments suggest that training universal language-agnostic retrieval systems is challenging and not always training a model with data from multiple markets can help the overall performance. Our proposed language-agnostic universal recommendation model, named FOREC-XCB, demonstrates a robust effectiveness by leveraging data from multiple markets and languages and improves the performance for each target market when compared to strong baselines.


Creative Commons License

Creative Commons Attribution 4.0 License
This work is licensed under a Creative Commons Attribution 4.0 License.

Available for download on Friday, September 01, 2023