Practical Tutorial to Retrieval Augmented Generation on Google Colab

RisingStack's services:

Search

Node.js Experts

Learn more at risingstack.com

Sign up to our newsletter!

In this article:

We will take a look at how to set up a RAG – Retrieval Augmented Generation – demo with the Anthropic Claude 3 Sonet model, using google’s CoLab platform. CoLab offers free instances with T4 GPUs sometimes, but we’ll only need a simple CPU instance, since we access the model only through API.

RAG can be used to update already trained models with new information to improve its question answering capabilities. We will be loading the new data into a vector database that will serve as an additional, external memory for the model. This will be accessed by a retrieval model – llama-index in our case – that constructs a task specific prompt and fetches the document, passing both on to the language model.

Dependencies

Grab something to enhance the model with – we’ll use a paper about QLoRA – Quantized Low Rank Adaptation – for this example, but this could be any text based content that was not part of the training for the particular model. Since this is a pdf, we’ll need to use a pdf loader later, make sure to account for that in case you want to use some other format.

We can use shell commands by prefixing them with an exclamation mark in the notebook. Using this makes it simple to download the source pdf:

!wget "https://arxiv.org/pdf/2305.14314.pdf" -O /content/QLORA.pdf

Note: for some reason I don’t fully understand yet, I had to open the pdf in my browser before CoLab could download it, otherwise I got 403 errors until I did.

There are some python dependencies we will also need to install:

!pip install torch llama-index==0.10.20 transformers accelerate bitsandbytes pypdf chromadb==0.4.24 sentence-transformers pydantic==1.10.11 llama-index-embeddings-huggingface llama-index-llms-huggingface llama-index-readers-file llama-index-vector-stores-chroma llama-index-llms-anthropic --quiet

The model

There are quite a few things we will need to import from the packages we just installed. Not only that, but we will have to register an account with Anthropic to be able to access their models, since they are not open source. They do, however, offer $5 worth of API usage for free, of which we’ll need about 3 cents for this demo. Feel free to spend the rest on whatever else you’d like to test with it! You can register an account with Anthropic here.

## app.py

## Import necessary libraries
import torch
import sys
import chromadb
from llama_index.core import VectorStoreIndex, download_loader, ServiceContext, Settings
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core.storage.storage_context import StorageContext
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.readers.file import PDFReader
from llama_index.llms.anthropic import Anthropic
from transformers import BitsAndBytesConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria, StoppingCriteriaList
from llama_index.core import PromptTemplate
from llama_index.llms.huggingface import HuggingFaceLLM
from IPython.display import Markdown, display, HTML
from pathlib import Path
import os

After registering and activating the $5 voucher, we will need to create an API key and then add it as an environment variable to the code as well.

os.environ["ANTHROPIC_API_KEY"] = "YOUR API KEY HERE"

Then, load the pdf we downloaded that contains details about QLoRA:

loader = PDFReader()
documents = loader.load_data(file=Path("/content/QLORA.pdf"))

The setup for the model itself is pretty simple in this case, since Anthropic models are not open source, and we can only interact with them through their API:

llm = Anthropic(
    model="claude-3-sonnet-20240229",
)

tokenizer = Anthropic().tokenizer
Settings.tokenizer = tokenizer
Settings.llm = llm
Settings.chunk_size = 1024

First, we ask the model about QLoRA to see if it possesses any knowledge on this topic:

# resp contains the response
resp = llm.complete("What is QLORA?")

# Using HTML with inline CSS for styling (gray color, smaller font size)
html_text = f'<p style="color: #1f77b4; font-size: 14px;"><b>{resp}</b></p>'
display(HTML(html_text))

QLORA is not a commonly recognized acronym or term that I'm familiar with. Without more context, it's difficult for me to provide a definitive explanation of what QLORA means or refers to. Acronyms can have multiple meanings across different fields or contexts. Could you provide some additional details about where you encountered this term or what domain it relates to? That would help me try to determine the intended meaning of QLORA.

New context and questions

As you can see from the following output, the model doesn’t have data on this topic, which is great for us, as we can now attempt to extend its knowledge on the subject using RAG. We’ll now set up ChromaDB as our vector database, and load the data from the downloaded paper to it. Chroma is an open-source vector embedding database. When queried, it will compute the feature vector of our prompt and retrieve the most relevant documents – the one we will load into it – using similarity search, so the document can be then passed to the language model as context. You don’t need to attach any external servers, as ChromaDB can run within our Jupyter Notebook and was installed in the beginning with pip.

#Create client and a new collection
chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.create_collection("firstcollection")

# Load the embedding model
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")

# Set up ChromaVectorStore and load in data
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)
index = VectorStoreIndex.from_documents(
  documents, storage_context=storage_context, service_context=service_context
)

Chroma will not have a hard time figuring out which document to return since we only use one, but you can definitely play around with loading additional documents and seeing how that affects the result. For now though, we will just ask the same question again:

#Define query
query="what is QLORA?"

query_engine =index.as_query_engine(response_mode="compact")
response = query_engine.query(query)

# Using HTML with inline CSS for styling (blue color)
html_text = f'<p style="color: #1f77b4; font-size: 14px;"><b>{response}</b></p>'
display(HTML(html_text))

But now we get a different answer:

QLORA stands for Quantized Low-Rank Adaptation. It is a technique for efficiently finetuning large language models by only updating a small set of parameters during training, rather than the full model weights. This allows for significant memory savings compared to standard full model finetuning. QLORA uses quantization to further reduce the memory footprint, with model weights stored in a low-precision 4-bit format during training. The key components are low-rank adaptation using LoRA layers, and quantization using a custom 4-bit numeric format called NormalFloat4. This enables finetuning of very large models like GPT-3 on a single GPU, which would not be feasible with full precision finetuning.

This seems like a pretty good quality answer, and in-line with the paper we used as context. This example demonstrates how we can use RAG to feed new information to a model to keep it up to date without having to re-train it on an even larger dataset from the beginning.

Now that we confirmed that the new context is indeed being used, we can also ask something else that should be included in the paper:

chat_engine = index.as_chat_engine(chat_mode="condense_question", verbose=True)
response = chat_engine.chat("What would be potential real world use cases for QLoRA?")
print(response)

And the response:

Querying with: Given the previous conversation, what would be potential real-world use cases or applications for QLoRA (Quantization-aware Learned Optimized Residual Addition), which is a technique for quantizing and compressing large language models like GPT while maintaining their performance?
Based on the information provided, some potential real-world applications and use cases for QLoRA could include:

1. Enabling efficient finetuning and deployment of large language models on resource-constrained devices like mobile phones, IoT devices, or edge computing systems with limited memory and compute capabilities.

2. Facilitating the use of state-of-the-art large language models in cloud services or web applications where memory and computational efficiency is crucial for scalability and cost-effectiveness.

3. Allowing researchers and developers to experiment with and finetune very large language models (e.g., 65B parameters) on modest hardware like a single GPU, accelerating research and development in natural language processing.

4. Reducing the carbon footprint and energy consumption associated with training and running large language models, making them more environmentally sustainable.

5. Enabling the deployment of high-performance language models in embedded systems, robotics, or other specialized hardware with strict memory and compute constraints.

The key advantage of QLoRA seems to be its ability to compress and quantize large language models to a 4-bit representation while preserving their performance through efficient finetuning techniques like Low Rank Adapters (LoRA). This could unlock a wide range of applications where state-of-the-art language models were previously impractical due to resource constraints.

While the model couldn’t quite figure out what QLoRA stands for – the source paper did not have the explanation for it, so it’s hardly a surprise – the response ended up being a pretty solid list of pros.

Summary

From this small scale RAG demo, we can see how easy it can be to enhance the memory of language models in a performant and relatively resource efficient manner – the vector database will still require space to store all the new information, but no GPU is needed for just the process itself. In the example, we used an externally hosted model, but you can use other pre-trained models that can run in CoLab, provided you can get a hold of one of those elusive GPU instances.

Share this post

Twitter
Facebook
LinkedIn
Reddit

Node.js
Experts

Learn more at risingstack.com

Node.js Experts