Navigating Linguistic Diversity in Human-Robot Interactions
In the dynamic realm of human-robot interactions, a fundamental challenge lies in effectively addressing the extensive linguistic diversity exhibited by individuals. Many people today are fluent in two or even three languages, including their native tongues and English, while also learning additional languages through formal education. To facilitate seamless engagement with multilingual individuals, a social robot must possess the capability to comprehend and interact in multiple languages, ensuring a truly inclusive and enriching experience.
Let us take the example of Alex and her companion robot, Miko. Alex is proficient in English but is also learning Spanish at her school. She predominantly speaks to Miko in English but would like to switch to Spanish at times to improve her conversational proficiency. Thus Miko would need to understand and respond in both these languages.
In our previous blog Teaching Robots to Converse, we decoded the steps involved in converting a query by a user to a response by a robot. In this article, we will address linguistic diversity in human-robot interactions, exploring the complexities and potential solutions for robots to effectively understand and respond to individuals speaking in diverse languages.
Multilingual Sentence Embeddings
Embedding extraction, which is the conversion of natural language queries into dense vector representations that capture the semantic meaning, forms an important step in automating human-robot interactions. Embeddings encode the contextual and semantic information of the query, allowing machine learning models to process and understand the query intent effectively and formulate an appropriate response. In a multilingual scenario, the embeddings generated need to capture the similarities and relationships between sentences, words and phrases in different languages, enabling cross-lingual understanding and analysis. In other words, sentences and phrases in different languages are represented in the same vector space with the property that words, phrases and sentences with similar meanings, albeit in different languages, must be close together in that vector space.
Let's consider an example of sentences in English and Spanish that mean the same - “How are you doing?” and "¿Cómo estás?". When these are represented in a multilingual vector space using multilingual sentence embeddings, they appear close together (see Figure 1). Similarly, “What is your name?” and “Comment tu t'appelles ?" (which means the same in French), appear close together in the multilingual vector space.
Figure 1: Sentences in different languages with similar meanings appear close together in multilingual embedding
Creating Multilingual Embeddings
Multilingual embeddings can be generated through the use of large-scale language models and training techniques that leverage vast amounts of multilingual text data. The training data can include web pages, books, articles, and other textual sources available in different languages. Numerous studies suggest alternative training approaches that offer swift execution and reduced demand for training data. Knowledge distillation is one such technique proposed by Reimers and Gurevych that is based on the idea that a translated sentence should map to the same location in the vector space as the original sentence. In this approach, a monolingual model (teacher model) is used to generate the sentence embeddings for the source language and a new system (student model) is trained on translated sentences to mimic the original monolingual model. This allows easy extension of existing monolingual models for multilingual embedding with relatively few training samples.
Artetxe and Shwenk propose an innovative zero-shot method, to generate multilingual embeddings using a single, universal, language model, trained on a multilingual dataset. In natural language processing (NLP), the zero-shot method refers to the capability of a model to perform a task in a language for which it was not specifically trained. Typically, NLP models are trained on labeled data in distinct languages, to learn patterns and associations crucial for tasks like text classification or machine translation. The zero-shot strategy, however, extends this learned knowledge across languages, empowering models to predict and generate outputs in languages they haven't undergone direct training in.
BERT (Bidirectional Encoder Representations from Transformers) is a powerful language representation model from the Transformer architecture family, introduced by Google in 2018. SBERT (Sentence-BERT), built on top of the BERT architecture, generates semantically meaningful representations (embeddings) of sentences in a way that captures their contextual and semantic similarities. SBERT has support for both monolingual and multilingual models.
Language agnostic sentence embedding (LaBSE), yet another extension of BERT, has the capability to produce sentence embeddings that capture contextual and semantic similarities across 109 languages. LaBSE works by combining techniques from masked language modeling (MLM) and translation language modeling (TLM) during its training process to learn monolingual and cross-lingual representations. Other notable works include Cohere, a multilingual embedding model that provides language-agnostic representations of text across 100+ languages, and LASER, a library by Meta, that calculates multilingual embeddings across 9 European languages.
Summary of Steps
The flowchart below summarizes the steps to generate multilingual embeddings for addressing linguistic diversity in human-robot conversations. The first step is to gather relevant and comprehensive multilingual text corpus that can serve as training data. Subsequent to data collection, a preprocessing stage is initiated involving essential tasks such as data cleaning, tokenization, punctuation removal, and language-specific tokenization.
Figure 2: Steps to generate multilingual embeddings
The next pivotal step is to architect an embedding model (usually based on the transformer architecture) that captures the conceptual and semantic information of the text. This model is then rigorously trained on the multilingual text corpus and the model parameters are fine-tuned using techniques such as backpropagation and gradient descent. This trained model is then used to extract embeddings for individual words, phrases or sentences that capture the semantic information and relationships across languages.
Once language agnostic embedding extraction is done, intent inference through similarity search and response formulation are the next steps before your robot is ready to participate in human conversations.
Embracing Many Languages: A Way to Make Robots Talk to Everyone
By transcending linguistic barriers through language-agnostic NLP techniques, we enable individuals from diverse language backgrounds to engage with social robots in their native tongues, promoting a sense of belonging and empowerment. Using multilingual sentence embeddings based on the BERT architecture, Miko, the innovative companion robot, supports barrier-free communication to understand and respond to user’s queries, regardless of language.
Stay tuned for the next blog where we will delve into intent inference, the next step in decoding human-robot interactions.