Constructing a Retrieval-Augmented Technology (RAG) System with FAISS and Open-Supply LLMs -

Retrieval-augmented era (RAG) has emerged as a robust paradigm for enhancing the capabilities of enormous language fashions (LLMs). By combining LLMs’ inventive era skills with retrieval programs’ factual accuracy, RAG provides an answer to certainly one of LLMs’ most persistent challenges: hallucination.

On this tutorial, we’ll construct a whole RAG system utilizing:

FAISS (Fb AI Similarity Search), as our vector database
Sentence Transformers for creating high-quality embeddings
An open-source LLM from Hugging Face (we’ll use a light-weight mannequin appropriate with CPU)
A customized data base that we’ll create

By the top of this tutorial, you’ll have a functioning RAG system that may reply questions primarily based in your paperwork with improved accuracy and relevance. This method is effective for constructing domain-specific assistants, buyer help programs, or any software the place grounding LLM responses in particular paperwork is vital.

Allow us to get began.

Step 1: Setting Up Our Surroundings

First, we have to set up all of the required libraries. For this tutorial, we’ll use Google Colab.

# Set up required packages
!pip set up -q transformers==4.34.0
!pip set up -q sentence-transformers==2.2.2
!pip set up -q faiss-cpu==1.7.4
!pip set up -q speed up==0.23.0
!pip set up -q einops==0.7.0
!pip set up -q langchain==0.0.312
!pip set up -q langchain_community
!pip set up -q pypdf==3.15.1

Let’s additionally test if we’ve entry to a GPU, which is able to velocity up our mannequin inference:

import torch


# Examine if GPU is offered
print(f"GPU obtainable: {torch.cuda.is_available()}")
if torch.cuda.is_available():
   print(f"GPU identify: {torch.cuda.get_device_name(0)}")
else:
   print("Working on CPU. We'll use a CPU-compatible mannequin.")

Step 2: Creating Our Data Base

For this tutorial, we’ll create a easy data base about AI ideas. In a real-world state of affairs, one can use it to import PDF paperwork, net pages, or databases.

import os
import tempfile


# Create a short lived listing for our paperwork
docs_dir = tempfile.mkdtemp()
print(f"Created short-term listing at {docs_dir}")


# Create pattern paperwork about AI ideas
paperwork = {
   "vector_databases.txt": """
   Vector databases are specialised database programs designed to retailer, handle, and search vector embeddings effectively.
   They're essential for machine studying functions, notably these involving pure language processing and picture recognition.
  
   Key options of vector databases embrace:
   1. Quick similarity search utilizing algorithms like HNSW, IVF, or actual search
   2. Help for varied distance metrics (cosine, euclidean, dot product)
   3. Scalability for dealing with billions of vectors
   4. Usually help for metadata filtering alongside vector search
  
   In style vector databases embrace FAISS (Fb AI Similarity Search), Pinecone, Weaviate, Milvus, and Chroma.
   FAISS particularly was developed by Fb AI Analysis and is an open-source library for environment friendly similarity search.
   """,
  
   "embeddings.txt": """
   Embeddings are dense vector representations of knowledge in a steady vector area.
   They seize semantic that means and relationships between entities by positioning comparable objects nearer collectively within the vector area.
  
   Sorts of embeddings embrace:
   1. Phrase embeddings (Word2Vec, GloVe)
   2. Sentence embeddings (Common Sentence Encoder, SBERT)
   3. Doc embeddings
   4. Picture embeddings
   5. Audio embeddings
  
   Embeddings are created via varied methods, together with neural networks skilled on particular duties.
   Trendy embedding fashions like these from OpenAI, Cohere, or Sentence Transformers can seize nuanced semantic relationships.
  
   The dimensionality of embeddings sometimes ranges from 100 to 1536 dimensions, with increased dimensions typically capturing extra data however requiring extra storage and computation.
   """,
  
   "rag_systems.txt": """
   Retrieval-Augmented Technology (RAG) is an AI structure that mixes data retrieval with textual content era.
  
   The RAG course of sometimes works as follows:
   1. Consumer question is transformed into an embedding vector
   2. Related paperwork or passages are retrieved from a data base utilizing vector similarity
   3. Retrieved content material is supplied as context to the language mannequin
   4. The language mannequin generates a response knowledgeable by each its parameters and the retrieved data
  
   Advantages of RAG embrace:
   1. Decreased hallucination in comparison with pure generative approaches
   2. Up-to-date data with out mannequin retraining
   3. Attribution of data sources
   4. Decrease computation prices than growing mannequin dimension
  
   RAG programs will be enhanced via methods like reranking, question reformulation, and hybrid search approaches.
   """
}


# Write paperwork to recordsdata
for filename, content material in paperwork.objects():
   with open(os.path.be a part of(docs_dir, filename), 'w') as f:
       f.write(content material)
      
print(f"Created {len(paperwork)} paperwork in {docs_dir}")

Step 3: Loading and Processing Paperwork

Now, let’s load these paperwork and course of them for our RAG system:

from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter


# Initialize a listing to retailer our paperwork
all_documents = []


# Load every textual content file
for filename in paperwork.keys():
   file_path = os.path.be a part of(docs_dir, filename)
   loader = TextLoader(file_path)
   loaded_docs = loader.load()
   all_documents.lengthen(loaded_docs)


print(f"Loaded {len(all_documents)} paperwork")


# Break up paperwork into chunks
text_splitter = RecursiveCharacterTextSplitter(
   chunk_size=500,
   chunk_overlap=50,
   separators=["nn", "n", ".", " ", ""]
)


document_chunks = text_splitter.split_documents(all_documents)
print(f"Created {len(document_chunks)} doc chunks")


# Let us take a look at a pattern chunk
print("nSample chunk content material:")
print(document_chunks[0].page_content)
print(f"Supply: {document_chunks[0].metadata}")

Step 4: Creating Embeddings

Now, let’s convert our doc chunks into vector embeddings:

from sentence_transformers import SentenceTransformer
import numpy as np


# Initialize the embedding mannequin
model_name = "sentence-transformers/all-MiniLM-L6-v2"  # A great stability of velocity and high quality
embedding_model = SentenceTransformer(model_name)


print(f"Loaded embedding mannequin: {model_name}")
print(f"Embedding dimension: {embedding_model.get_sentence_embedding_dimension()}")


# Create embeddings for all doc chunks
texts = [doc.page_content for doc in document_chunks]
embeddings = embedding_model.encode(texts)


print(f"Created {len(embeddings)} embeddings with form {embeddings.form}")

Step 5: Constructing the FAISS Index

Now we’ll construct our FAISS index with these embeddings:

import faiss


# Get the dimensionality of our embeddings
dimension = embeddings.form[1]


# Create a FAISS index - we'll use a easy Flat L2 index for demonstration
# For bigger datasets, think about using indexes like IVF or HNSW for higher efficiency
index = faiss.IndexFlatL2(dimension)  # L2 is Euclidean distance


# Add our vectors to the index
index.add(embeddings.astype(np.float32))  # FAISS requires float32


print(f"Created FAISS index with {index.ntotal} vectors")


# Create a mapping from index place to doc chunk for retrieval
index_to_doc_chunk = {i: doc for i, doc in enumerate(document_chunks)}

Step 6: Loading a Language Mannequin

Now let’s load an open-source language mannequin from Hugging Face. We’ll use a smaller mannequin that works nicely on CPU:

from transformers import AutoTokenizer, AutoModelForCausalLM


# We'll use a smaller mannequin that works on CPU
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"


# Load the tokenizer and mannequin
tokenizer = AutoTokenizer.from_pretrained(model_id)
mannequin = AutoModelForCausalLM.from_pretrained(
   model_id,
   torch_dtype=torch.float32,  # Use float32 for CPU compatibility
   device_map="auto"  # Will use CPU if GPU shouldn't be obtainable
)


print(f"Efficiently loaded {model_id}")

Step 7: Creating Our RAG Pipeline

Let’s create a operate that mixes retrieval and era:

def rag_response(question, index, embedding_model, llm_model, llm_tokenizer, index_to_doc_map, top_k=3):
   """
   Generate a response utilizing the RAG sample.


   Args:
       question: The consumer's query
       index: FAISS index
       embedding_model: Mannequin to create embeddings
       llm_model: Language mannequin for era
       llm_tokenizer: Tokenizer for the language mannequin
       index_to_doc_map: Mapping from index positions to doc chunks
       top_k: Variety of paperwork to retrieve


   Returns:
       response: The generated response
       sources: The supply paperwork used
   """
   # Step 1: Convert question to embedding
   query_embedding = embedding_model.encode([query])
   query_embedding = query_embedding.astype(np.float32)  # Convert to float32 for FAISS


   # Step 2: Seek for comparable paperwork
   distances, indices = index.search(query_embedding, top_k)


   # Step 3: Retrieve the precise doc chunks
   retrieved_docs = [index_to_doc_map[idx] for idx in indices[0]]


   # Create context from retrieved paperwork
   context = "nn".be a part of([doc.page_content for doc in retrieved_docs])


   # Step 4: Create immediate for the LLM (TinyLlama format)
   immediate = f"""<|system|>
You're a useful AI assistant. Reply the query primarily based solely on the supplied context.
If you do not know the reply primarily based on the context, say "I haven't got sufficient data to reply this query."


Context:
{context}
<|consumer|>
{question}
<|assistant|>"""


   # Step 5: Generate response from LLM
   input_ids = llm_tokenizer(immediate, return_tensors="pt").input_ids.to(mannequin.machine)


   generation_config = {
       "max_new_tokens": 256,
       "temperature": 0.7,
       "top_p": 0.95,
       "do_sample": True
   }


   # Generate the output
   with torch.no_grad():
       output = llm_model.generate(
           input_ids=input_ids,
           **generation_config
       )


   # Decode the output
   generated_text = llm_tokenizer.decode(output[0], skip_special_tokens=True)


   # Extract the assistant's response (take away the immediate)
   response = generated_text.cut up("<|assistant|>")[-1].strip()


   # Return each the response and the sources
   sources = [(doc.page_content, doc.metadata) for doc in retrieved_docs]


   return response, sources

Step 8: Testing Our RAG System

Let’s take a look at our system with some questions:

#Outline some take a look at questions
test_questions = [
   "What is FAISS and what is it used for?",
   "How do embeddings capture semantic meaning?",
   "What are the benefits of RAG systems?",
   "How does vector search work?"
]


# Take a look at our RAG pipeline
for query in test_questions:
   print(f"nn{'='*50}")
   print(f"Query: {query}")
   print(f"{'='*50}n")


   response, sources = rag_response(
       question=query,
       index=index,
       embedding_model=embedding_model,
       llm_model=mannequin,
       llm_tokenizer=tokenizer,
       index_to_doc_map=index_to_doc_chunk,
       top_k=2  # Retrieve high 2 most related chunks
   )


   print(f"Response: {response}n")


   print("Sources:")
   for i, (content material, metadata) in enumerate(sources):
       print(f"nSource {i+1}:")
       print(f"Metadata: {metadata}")
       print(f"Content material snippet: {content material[:100]}...")

OUTPUT:

Step 9: Evaluating and Enhancing Our RAG System

Let’s implement a easy analysis operate to evaluate the efficiency of our RAG system:

def evaluate_rag_response(query, response, retrieved_sources, ground_truth_sources=None):
   """
   Easy analysis of RAG response high quality


   Args:
       query: The question
       response: Generated response
       retrieved_sources: Sources used for era
       ground_truth_sources: (Non-obligatory) Recognized appropriate sources


   Returns:
       analysis metrics
   """
   # Primary metrics
   response_length = len(response.cut up())
   num_sources = len(retrieved_sources)


   # Easy relevance rating - we might use higher strategies in manufacturing
   source_relevance = []
   for content material, _ in retrieved_sources:
       # Depend overlapping phrases between query and supply
       q_words = set(query.decrease().cut up())
       s_words = set(content material.decrease().cut up())
       overlap = len(q_words.intersection(s_words))
       source_relevance.append(overlap / len(q_words) if q_words else 0)


   avg_relevance = sum(source_relevance) / len(source_relevance) if source_relevance else 0


   return {
       "response_length": response_length,
       "num_sources": num_sources,
       "source_relevance_scores": source_relevance,
       "avg_relevance": avg_relevance
   }


# Consider certainly one of our earlier responses
query = test_questions[0]
response, sources = rag_response(
   question=query,
   index=index,
   embedding_model=embedding_model,
   llm_model=mannequin,
   llm_tokenizer=tokenizer,
   index_to_doc_map=index_to_doc_chunk,
   top_k=2
)


# Run analysis
eval_results = evaluate_rag_response(query, response, sources)
print(f"nEvaluation outcomes for query: '{query}'")
for metric, worth in eval_results.objects():
   print(f"{metric}: {worth}")

Step 10: Superior RAG Strategies – Question Growth

Let’s implement question enlargement to enhance retrieval:

# Here is the implementation of the expand_query operate:


def expand_query(original_query, llm_model, llm_tokenizer):
   """
   Generate a number of search queries from an unique question to enhance retrieval


   Args:
       original_query: The consumer's unique query
       llm_model: The language mannequin for producing variations
       llm_tokenizer: Tokenizer for the language mannequin


   Returns:
       Checklist of question variations together with the unique
   """
   # Create a immediate for question enlargement
   immediate = f"""<|system|>
You're a useful assistant. Generate two different variations of the given search question.
The objective is to create variations which may assist retrieve related data.
Solely listing the choice queries, one per line. Don't embrace any explanations, numbering, or different textual content.
<|consumer|>
Generate different variations of this search question: "{original_query}"
<|assistant|>"""


   # Generate variations
   input_ids = llm_tokenizer(immediate, return_tensors="pt").input_ids.to(llm_model.machine)


   with torch.no_grad():
       output = llm_model.generate(
           input_ids=input_ids,
           max_new_tokens=100,
           temperature=0.7,
           do_sample=True
       )


   # Decode the output
   generated_text = llm_tokenizer.decode(output[0], skip_special_tokens=True)


   # Extract the generated variations
   response_part = generated_text.cut up("<|assistant|>")[-1].strip()


   # Break up response by traces to get particular person variations
   variations = [line.strip() for line in response_part.split('n') if line.strip()]


   # Guarantee we've at the least some variations
   if not variations:
       variations = [original_query]


   # Add the unique question and return the listing with duplicates eliminated
   all_queries = [original_query] + variations
   return listing(dict.fromkeys(all_queries))  # Take away duplicates whereas preserving order

Step 11: Evaluating and Enhancing Our expand_query operate

Let’s implement a easy analysis operate to evaluate the efficiency of our expand_query operate

# Instance utilization of expand_query operate
test_query = "How does FAISS assist with vector search?"


# Generate question variations
expanded_queries = expand_query(
   original_query=test_query,
   llm_model=mannequin,
   llm_tokenizer=tokenizer
)


print(f"Authentic Question: {test_query}")
print(f"Expanded Queries:")
for i, question in enumerate(expanded_queries):
   print(f"  {i+1}. {question}")


# Enhanced RAG with question enlargement
all_retrieved_docs = []
all_scores = {}


# Retrieve paperwork for every question variation
for question in expanded_queries:
   # Get question embedding
   query_embedding = embedding_model.encode([query]).astype(np.float32)


   # Search in FAISS index
   distances, indices = index.search(query_embedding, 3)


   # Monitor doc scores throughout queries (utilizing 1/(1+distance) as rating)
   for idx, dist in zip(indices[0], distances[0]):
       rating = 1.0 / (1.0 + dist)
       if idx in all_scores:
           # Take max rating if doc retrieved by a number of question variations
           all_scores[idx] = max(all_scores[idx], rating)
       else:
           all_scores[idx] = rating


# Get high paperwork primarily based on scores
top_indices = sorted(all_scores.keys(), key=lambda idx: all_scores[idx], reverse=True)[:3]
expanded_retrieved_docs = [index_to_doc_chunk[idx] for idx in top_indices]


print("nRetrieved paperwork utilizing question enlargement:")
for i, doc in enumerate(expanded_retrieved_docs):
   print(f"nResult {i+1}:")
   print(f"Supply: {doc.metadata['source']}")
   print(f"Content material snippet: {doc.page_content[:150]}...")


# Now use these paperwork with the LLM to generate a response
context = "nn".be a part of([doc.page_content for doc in expanded_retrieved_docs])


# Create immediate for the LLM
immediate = f"""<|system|>
You're a useful AI assistant. Reply the query primarily based solely on the supplied context.
If you do not know the reply primarily based on the context, say "I haven't got sufficient data to reply this query."


Context:
{context}
<|consumer|>
{test_query}
<|assistant|>"""


# Generate response
input_ids = tokenizer(immediate, return_tensors="pt").input_ids.to(mannequin.machine)
with torch.no_grad():
   output = mannequin.generate(
       input_ids=input_ids,
       max_new_tokens=256,
       temperature=0.7,
       top_p=0.95,
       do_sample=True
   )


# Extract response
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
response = generated_text.cut up("<|assistant|>")[-1].strip()


print("nFinal RAG Response with Question Growth:")
print(response)

Output:

FAISS can deal with a variety of vector sorts, together with textual content, picture, and audio, and will be built-in with widespread machine studying frameworks akin to TensorFlow, PyTorch, and Sklearn.

Conclusion

On this tutorial, we’ve constructed a whole RAG system utilizing FAISS as our vector database and an open-source LLM. We carried out doc processing, embedding era, and vector indexing, and built-in these elements with question enlargement and hybrid search methods to enhance retrieval high quality.

Additional, we will take into account:

Implementing question reranking with cross-encoders
Creating an internet interface utilizing Gradio or Streamlit
Including metadata filtering capabilities
Experimenting with totally different embedding fashions
Scaling the answer with extra environment friendly FAISS indexes (HNSW, IVF)
Fantastic-tuning the LLM in your domain-specific knowledge

Helpful assets:

Right here is the Colab Notebook. Additionally, don’t neglect to comply with us on Twitter and be a part of our Telegram Channel and LinkedIn Group. Don’t Neglect to affix our 80k+ ML SubReddit.

Asjad is an intern advisor at Marktechpost. He’s persuing B.Tech in mechanical engineering on the Indian Institute of Know-how, Kharagpur. Asjad is a Machine studying and deep studying fanatic who’s at all times researching the functions of machine studying in healthcare.