The diagrams show the data distribution in the model, where German data is shown in red and English data in blue. On the left side we can see that language is greatly influencing the embedding distribution. No overlap between the colors means that there is a language bias. The right diagram shows blue and red points well interleaved, which means English and German embeddings map to the same semantic space. Beware that this visualization technique is only an illustrative starting point and does not allow us to quantify the degree of alignment.
How to measure it?
To quantify and compare the language bias of different models, you can follow one of these approaches:
1. Evaluate your model on the LAReQA dataset as described in the paper LAReQA: Language-agnostic answer retrieval from a multilingual pool by Roy et al [1]. Besides a method for quantifying the bias, the paper proposes a useful heatmap visualization where you can see the degree of alignment between all language pairs. On the downside, the LAReQA dataset does not consist of a lot of languages.
2. Using an approach from the paper Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation by Nils Reimers and Iryna Gurevych [2]. They propose to evaluate the model on a semantic textual similarity (STS) task using a multilingual STS dataset and see how much worse the model performs when you don’t test on every language individually but on a candidate set with all languages at once.
In addition to quantifiable measures, it’s also advisable to test for language bias manually. Select a couple of queries, translate them to different languages and issue a search for each translation. You should find approximately the same results for each translation of the same query. If you see results following the language of your query instead, you have a biased model without strong alignment.
Mitigating language bias
For the G39 project we built up large fine-tuning datasets from news articles from different countries to let the model learn different news contexts. Due to our experience with training mono-lingual embedding models we were confident that we could create a model that knows the current news context of multiple countries. A central question was whether this trained model is usable for multilingual semantic search. We researched and brainstormed many ideas, but the solution was surprisingly simple in the end: the baseline training approach produced satisfactory results already. That is, we merged our nine large monolingual finetuning datasets together into one consisting of queries and news article chunks in nine different languages (but each query-article-pair still used only language). Training on this dataset using a standard contrastive loss with in-batch negatives eliminated most of the language bias that we measured and negatively noticed in the base model.
Why does this work? We are still in the process of finding out, but we can already share insights:
- having a large amount of high-quality multilingual fine-tuning data at our disposal, and
- having a considerable topic overlap between the different languages in the dataset, because a good part of the news is international.
We nicknamed our approach “brute-force mitigation”, since obviously having a large high-quality amount of training data did the trick. That being said, if you don’t have large amounts of multilingual finetuning data at hand but maybe just data in one language, the approach from Reimers and Gurevych [2] may be worth taking a look at. The paper explains how to extend the capabilities of an existing embedding model to new languages with a training objective that targets equal embeddings in all languages (i.e. strong alignment).
References:
[1] LAReQA: Language-agnostic answer retrieval from a multilingual pool https://arxiv.org/abs/2004.05484
[2] Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation https://arxiv.org/pdf/2004.09813