Basic Retrieval Augmented Generation with Local LLMs

Experiences and tips for Retrieval Augmented Generation (RAG) with Ollama for ground LLMs on up-to-date or private information.

Basic Retrieval Augmented Generation with Local LLMs

This session was held by our colleague Kieran at the 3rd AI Insights Barcamp (N°4 will start in September 2024) - and he is now sharing his experiences and tips here in the AI Product Circle.


Next chance to meet us:

Meetup 04. April 24 - 16:30 Uhr: Wie wir die ideale Customer Journey mit AI bauen
Termin verschoben! Am Donnerstag, 04. April zeigen wir dir, wie wir die ideale Customer Journey mit AI bauen.

Retrieval Augmented Generation is the process of telling an LLM "use this resource as the foundation of your answer". This helps prevent inaccurate, outdated, or hallucinated responses that are common with LLMs, without incurring the cost of fine tuning or training the LLM directly on your resource. But doing this with an LLM on the internet can mean exposing your data to the provider of that LLM. This can be unsuitable for private information or data that is internal to a company, so let's take a look at how we would do it offline using a local language model.

Running an LLM locally

Ollama is a convenient tool for running LLMs on your own machine. It will take care of using a supported GPU if you have one, otherwise it will run the LLM on your CPU. For our purposes, either is fine. Once we have ollama installed, we can start an LLM as easy as running ollama run llama2:7b. This will download the llama2:7b model if it isn't available locally, and then start the model with a text chat interface.

Providing a context to an LLM

Now that we have an LLM available locally, we will want to make it aware of our source(s) of truth. We will do that by writing some simple Python code using the llama-index package.
Llama-index makes RAG extremely easy by generating the embeddings we need for our LLM and supports optionally storing the vectors to speed up iteration on our application.

Let's take a look at a minimal example where we read a document, in our case, the NixOS manual, and generate the embeddings for it.

# Import modules
from llama_index.llms.ollama import Ollama
from pathlib import Path
from llama_index.core import VectorStoreIndex
from llama_index.readers.file import HTMLTagReader
from llama_index.core import Settings
from llama_index.embeddings.ollama import OllamaEmbedding
import sentencepiece as spm

# use the tokenizer for llama2
Settings.tokenizer = spm.SentencePieceProcessor().Encode
Settings.llm = Ollama(model="llama2:7b", temperature=0.1)
Settings.embed_model = OllamaEmbedding(model_name="llama2:7b")

# Load HTML data
loader = HTMLTagReader()
manual = loader.load_data(Path("./docs/nixos.org/manual/nixos/stable/index.html"))

# Create VectorStoreIndex and query engine
index = VectorStoreIndex.from_documents(manual, show_progress=True)

As we run this, our terminal will display a progress bar that shows the parsing of the document and the creation of the appropriate embeddings. But how do we use it? All that’s left is to use the index to create an interactive engine. We can add the following lines to our script:

query_engine = index.as_query_engine(similarity_top_k=20, streaming=True)

while True:
    response = query_engine.query(input(">>> "))
    response.print_response_stream()
    print("\n")

Here we create a query engine from the index of our documents, and then in a loop we prompt for user input which is sent to the LLM and then we print the response stream.

Now, when we rerun our python script and create the embeddings, we will be able to write questions to the LLM about the documents.

Storing the embeddings

It can get a bit frustrating to need to regenerate our embeddings and watch that progress bar every time we restart our app. The good news is, we can easily store the vectors in a vector database so that subsequent runs of our application can skip this step. Let’s try it out using qdrant.

We will replace the line in our script where we call VectorStoreIndex.from_documents with the following:

# Create VectorStoreIndex and query engine
client = None
vector_store = None

if os.path.isdir("./qdrant_data/") == False:
    client = qdrant_client.QdrantClient(path="./qdrant_data")
    vector_store = QdrantVectorStore(client=client, collection_name="manual")
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    loader = HTMLTagReader()
    manual = loader.load_data(Path("./docs/nixos.org/manual/nixos/stable/index.html"))
    index = VectorStoreIndex.from_documents(
        manual, show_progress=True, storage_context=storage_context
    )

# Create Qdrant client and vector store
if client == None:
    client = qdrant_client.QdrantClient(path="./qdrant_data")

if vector_store == None:
    vector_store = QdrantVectorStore(client=client, collection_name="manual")

index = VectorStoreIndex.from_vector_store(vector_store=vector_store)

What’s going on here? We create a qdrant client, which by default “connects” to a qdrant node running in memory (this is convenient because we don’t need to worry about running a database instance when we are just developing the app). We check if there is a directory named qdrant_data in our working directory, and if there is, we skip reading our source documents and performing the embedding generating step. If there is no directory, we create it and generate our embeddings with a StorageContext passed this time. The storage context records our vectors in qdrant, which then persists its data in that qdrant directory.

Putting it all together, our script looks like this:

# Import modules
from llama_index.llms.ollama import Ollama
from pathlib import Path
from llama_index.core import VectorStoreIndex
from llama_index.readers.file import HTMLTagReader
from llama_index.core import Settings
from llama_index.embeddings.ollama import OllamaEmbedding
import sentencepiece as spm

import os
import qdrant_client
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import StorageContext

# use the tokenizer for llama2
Settings.tokenizer = spm.SentencePieceProcessor().Encode
Settings.llm = Ollama(model="llama2:7b", temperature=0.1)
Settings.embed_model = OllamaEmbedding(model_name="llama2:7b")

# Load JSON data
loader = HTMLTagReader()
manual = loader.load_data(Path("./docs/nixos.org/manual/nixos/stable/index.html"))

# Create VectorStoreIndex and query engine
client = None
vector_store = None

if os.path.isdir("./qdrant_data/") == False:
    client = qdrant_client.QdrantClient(path="./qdrant_data")
    vector_store = QdrantVectorStore(client=client, collection_name="manual")
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    loader = HTMLTagReader()
    manual = loader.load_data(Path("./docs/nixos.org/manual/nixos/stable/index.html"))
    index = VectorStoreIndex.from_documents(
        manual, show_progress=True, storage_context=storage_context
    )

# Create Qdrant client and vector store
if client == None:
    client = qdrant_client.QdrantClient(path="./qdrant_data")

if vector_store == None:
    vector_store = QdrantVectorStore(client=client, collection_name="manual")

index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
query_engine = index.as_query_engine(similarity_top_k=20, streaming=True)

while True:
    response = query_engine.query(input(">>> "))
    response.print_response_stream()
    print("\n")

Running the script, we now only generate the embeddings from our source document if the qdrant_data directory is present.

Evaluating the results

In the resulting query interface we can ask the LLM direct questions about the source.

>>> What does the NixOS manual say about installing packages for single users instead of globally?

According to the NixOS manual, when installing packages for a single user instead of globally, you can use the `nix-env` command followed by the name of the user. For example, to install the `hello` package for a user named `alice`, you can run:

    nix-env -i alice hello
	
This will install the `hello` package in the user's profile directory, rather than globally on the system.

Conclusion

We have now built a basic app which uses retrieval augmented generation to answer questions from a user. We persist the vectors to avoid regenerating them each time we run the app, and can regenerate them (if we change / add new documents to our sources) by simply removing the qdrant_data directory. The full runnable code for this example is available at this git repository.


Author from the Geek Space 9 team:

const (
    Name = "Kieran O'Sullivan"
    Nick = "kidsan"
)