All About Retriever Models

All About Retriever Models

Perfecting Query-to-Context Matching for Open-Domain Question-Answering

Have you ever found yourself amazed by your companion bot's knack for having all the answers? In this blog, we will take a deep look into retriever models - the digital counterparts of your friendly librarian, swiftly pinpointing the perfect documents or passages to assist your bot in addressing your queries. Imagine you're chatting with your bot and you ask, "What's the tallest mountain in the world?" Behind the scenes, your bot quietly transforms your query into vectors and dispatches its reliable retriever model on a quest through a vast repository of knowledge, much like a librarian scouring shelves for the perfect book. These retriever models play a pivotal role in ensuring your bot provides accurate and relevant responses. Let us now take a closer look at retriever models to understand what they are and how they work.

Retriever Models: The Basics

In open-domain question answering, the challenge lies in accessing and retrieving relevant information from the vast sea of unstructured data available on the internet and otherwise. Imagine you're curious about the tallest mountain in the world, Mount Everest. Without specific knowledge of where to look, searching through countless web pages and documents for this information can be time-consuming and inefficient. This is where retriever models step in. These models act as intelligent filters, swiftly sifting through massive datasets to pinpoint relevant passages in documents that contain the desired information. Using techniques like text embeddings and approximate nearest neighbor search, retriever models can match user queries with pertinent passages, efficiently narrowing down the search space and presenting the most relevant information to the user. For instance, when you ask your virtual assistant, "What is the tallest mountain?", the retriever model can quickly identify and retrieve relevant passages about Mount Everest from a vast knowledge base.

Now, how do retriever models do this? Retrieval models work by transforming both the user query and the passages in the knowledge base into numerical representations, often referred to as embeddings. These embeddings capture semantic similarities between words and phrases, allowing the model to effectively compare the query with passages in the knowledge base. In other words, embeddings can be viewed as a special code for each word that captures its essence in a numerical form. However, these codes aren't just random numbers – they're carefully crafted so that similar words have similar codes.

For example, let's take the words "cat" and "dog." In our library, these words might frequently appear together in books about pets. So, their embeddings would be designed to be close together in numerical space, indicating their similarity. Now, consider the words "cat" and "car." While they sound similar, they belong to different semantic categories. Therefore, their embeddings would be further apart in numerical space, reflecting their distinct meanings.

How are Embeddings Computed?

Determining the appropriate numerical representation or vector for a given word, sentence, or passage isn't a straightforward process. There's no simple formula or technique to compute these vectors. Instead, these are learned by deep neural network models trained on vast amounts of text data. These models undergo extensive training to encode words or phrases as numerical vectors in a high-dimensional space, capturing semantic similarities between them based on their contexts in the training data.

For instance, consider BERT (Bidirectional Encoder Representations from Transformers), a prominent example of such a model. During its training phase, BERT learns to predict masked words in a sentence given their surrounding context. This process results in embeddings that contain rich semantic information, as they encapsulate the nuanced relationships between words within sentences. In essence, these learned embeddings enable the model to grasp the intricate meanings and connections inherent in natural language.

More about the World of Embeddings

There are several types of embeddings such as word and contextual word embeddings, sentence embeddings, entity embeddings and knowledge graph embeddings. Each type of embedding has its own advantages and is suited to different tasks and applications such as word similarity, document classification, semantic search, and knowledge graph completion. The choice of embedding technique depends on the specific requirements of the task and the characteristics of the data being analyzed.

Similar to any software, embedding models are available in both closed-source and open-source variants. Closed-source embedding models are proprietary models developed and maintained by companies or organizations. Access to these models may be restricted, and users typically need to obtain licenses or pay subscription fees to use them. Examples of closed source embedding models include:

  • Open AI embedding models: Open AI offers text embedding models that can be used  via paid APIs (application programming interfaces). These models allow you to trade off performance with cost. 
  • Cohere’s Embed: Cohere also offers text embedding models through paid APIs, emphasizing heavily on accuracy and efficiency.

Image Source 

Open source embedding models are freely available models that anyone can access, use, and modify without restrictions. Examples of open source embedding models include:

  • BAAI (Bidirectional Attentive Autoencoder for Inducing Semantics) Embedding Models: These models leverage bidirectional encoding, attention mechanisms, and autoencoding techniques to generate text embeddings of high accuracy. BAAI models stand out for their ability to generate highly meaningful embeddings for text data, achieving the top scores in benchmarks such as the Massive Text Embedding Benchmark (MTEB). These models have proven useful in enhancing semantic search engines and RAG retrieval.

  • Snowflake’s Arctic Family of Embedding Models: Snowflake recently open-sourced its Arctic embedding models under an Apache 2.0 license. These provide advanced retrieval capabilities to organizations when integrating proprietary datasets with LLMs for Retrieval Augmented Generation (RAG) or semantic search services. As of April 2024, Snowflake’s  models are ranked first among embedding models of similar size, and their largest model is only outperformed by open-source models with over 20 times (and four times for closed models) the number of parameters or closed-source models that do not disclose any form of model characteristics. 

  • Image Source 

    • Models tuned for MSMARCO Passage Ranking: MS MARCO Passage Ranking is a benchmark dataset and evaluation task designed for evaluating passage retrieval models, where the goal is to accurately rank passages from a large corpus based on their relevance to user queries. Models tuned for this task include BERT (Bidirectional Encoder Representations from Transformers) and T5 (Text-To-Text Transfer Transformer), which are often fine-tuned on the MS MARCO dataset to learn to generate embeddings for passages and rank them based on their relevance to the query.

    • Jina Embedding models: Multimodal AI company, Jina released its open-source sentence embedding model that supports context lengths of up to 8K. It also released bilingual embedding models for German-english and Chinese-english this year.

    Assessing the Effectiveness of Embedding Models: Benchmarks and Metrics

    Evaluating the performance of embedding models is essential for understanding their effectiveness in capturing semantic information and facilitating downstream natural language processing tasks. Benchmarks and metrics play a crucial role in this assessment, providing standardized evaluation settings and quantitative measures of model performance.

    • MTEB (Massive Text Embedding Benchmark): MTEB, a renowned benchmark dataset and evaluation platform, assesses the efficacy of text embeddings across tasks like semantic similarity, document classification, and information retrieval. Developed collaboratively by the research community, MTEB offers a standardized benchmark, facilitating direct comparisons between embedding models across diverse domains and applications.
    • MS MARCO (Microsoft MAchine Reading COmprehension) Passage Ranking: MS MARCO Passage Ranking is a benchmark dataset and evaluation task dedicated to passage retrieval models. It features authentic user queries and human-annotated relevance judgments for web passages. Models are assessed on their capacity to precisely rank passages based on their relevance to the query, offering a standardized evaluation framework for embedding models in information retrieval scenarios.
    • Evaluation Metrics: In addition to benchmark datasets, various evaluation metrics are used to quantify the performance of embedding models. Common metrics include Mean Average Precision (MAP), Mean Reciprocal Rank (MRR), and Accuracy. MAP measures the average precision across multiple queries, while MRR measures the rank of the first relevant passage retrieved for each query. These metrics provide insights into the overall effectiveness of embedding models in capturing semantic similarities and facilitating downstream tasks.

    In summary, benchmarks such as MTEB and MS MARCO Passage Ranking, along with evaluation metrics like MAP and MRR, are invaluable tools for assessing the effectiveness of embedding models. By providing standardized evaluation settings and quantitative measures of performance, these benchmarks and metrics enable researchers to compare different models, identify areas for improvement, and advance the state-of-the-art.

    Vector Libraries and Databases: Storing and Retrieving Embeddings

    The next crucial question pertains to the storage and retrieval of these embeddings: Where are they stored, and how are they accessed when a query is initiated?

    Embeddings, which are high-dimensional vectors, are stored in specialized libraries and databases called vector libraries/databases. These are optimized for similarity search tasks, where the goal is to find vectors that are most similar to a given query vector. By using advanced indexing structures and algorithms, these libraries and databases can quickly retrieve nearest neighbors to a query vector.

    Vector libraries, including Facebook Faiss, Spotify Annoy, Google ScaNN, NMSLIB, and HNSWLIB, utilize in-memory indexes to enable similarity search with vector embeddings. These libraries focus on storing embeddings rather than the associated objects or passages, which are typically stored in secondary storage like databases, necessitating a two-step retrieval process. Indexes in vector libraries are immutable, and queries cannot be executed during data import. The libraries leverage the approximate nearest neighbor (ANN) algorithm for similarity search, with different implementations across libraries: Faiss utilizes clustering, Annoy employs tree-based methods, and ScaNN employs vector compression. Each implementation offers performance tradeoffs, allowing users to select based on application requirements.

    Vector databases, on the other hand, offer comprehensive CRUD (Create, Read, Update, Delete) support, addressing the limitations of vector libraries. Designed for enterprise-level production deployments, databases provide a more holistic solution by accommodating both data objects and vectors. This integration enables combining vector search with structured filters, ensuring that nearest neighbors align with metadata filters. Unlike vector libraries, databases permit querying and modifying data during the import process. As millions of objects are uploaded, the data remains fully accessible and operational, eliminating the need to wait for import completion before accessing existing data.

    Feature Comparison: Vector Libraries vs Databases (Image Source

    Both commercial and open-source vector databases are available to cater to various needs and preferences. The image below showcases some of the most widely used vector databases.

    The Landscape of Vector Databases (Image Source)

    This brings us to the end of this blog on Retriever Models and the role they play in Open-domain QA. Given a query vector, these models transform them to embeddings and identify relevant context vectors from the contextual vector databases. In the next blog, we will look at how we can improve the precision and relevance of QA using techniques such as re-ranking and hybrid search on top of the output of the retriever models. Stay tuned!

    Back to blog