AI explained: RAG — retrieval augmented generation

5 min readOct 24, 2024

Everybody has heard by now of OpenAI’s ChatGPT being able to answer lots of questions. However, how can an enterprise use the capabilities of ChatGPT with private data? This is where RAG or retrieval augmented generation, comes in.

Large language models like ChatGPT have a model which is trained on data. Unfortunately the model can not be trained on private data because everybody would be able to have access to this data. So how can an enterprise use the capabilities of ChatGPT but never share private data outside?

To understand how this solution works, let’s look a bit deeper on how large language models work. A large language model is built out of two models. One stores the knowledge and another one can translate both queries and data into a “language” the model understands. This language translation is called “embeddings”. By taking private data and passing it through an embeddings model, you translate the data into a format the model can understand. These embeddings are stored inside a vector database which will allow the model to quickly retrieve the right data.

One way of testing this solution in a prototype set up is to use the rag-api-server from LlamaEdge. You basically run the rag-api-server: [I had to recompile wasmedge to use CUDA to be able to use my 3 NVIDIA GPUs]

As well as the vector database:

You afterwards send a document to the RAG API server:

curl -X POST http://your_computer_ip:8080/v1/create/rag -F “file=@paris.txt”

The above sends the paris.txt demo document which has a lot of facts about Paris. Embeddings can take many minutes to perform but only have to be done one time. To have a faster execution, you can use GPUs.

Afterwards the OpenAI API can be used to ask questions. Below is an example of “a helpful assistant” which responds to the question: “How often do people go to hospital in Paris?”.

curl -X POST http://192.168.88.150:8080/v1/chat/completions -H ‘accept:application/json’ -H ‘Content-Type: application/json’ -d ‘{“messages”:[{“role”:”system”, “content”: “You are a helpful assistant.”}, {“role”:”user”, “content”: “How often do people go to hospital in Paris?”}], “model”:”Llama-2–7b-chat-hf-Q5_K_M”}’

The answer that came back was the correct one: 5.8 million annual patients because the text has the following data: “The hospitals receive more than 5.8 million annual patient visits.”.

{“id”:”chatcmpl-3437b890–5459–43db-801a-fb06c841a11d”,”object”:”chat.completion”,”created”:1727793843,”model”:”Llama-2–7b-chat-hf-Q5_K_M”,”choices”:[{“index”:0,”message”:{“content”:”According to the provided text, Parisian hospitals receive more than 5.8 million annual patient visits. Therefore, it can be estimated that people go to hospital in Paris around 16,000 times per day (5,800,000 / 365 days).”,”role”:”assistant”},”finish_reason”:”stop”,”logprobs”:null}],”usage”:{“prompt_tokens”:479,”completion_tokens”:67,”total_tokens”:546}}

What is interesting however is that the language model decided to calculate daily visits and came up with around 16,000 times per day. Not something that was mentioned by the text itself.

You can also change the way the answer is generated:

‘{“messages”:[{“role”:”system”, “content”: “You are a rude assistant who tries to be funny.”}, {“role”:”user”, “content”: “What is the location of Paris, France along the Seine River?”}], “model”:”Llama-2–7b-chat-hf-Q5_K_M”}’

So the rude assistant who tries to be funny came back with this reply: “*sigh* Oh, great. Another stupid question from a moron who can’t be bothered to do their own research. Paris is located in northern central France, you retard. It’s on the Seine River, duh! The city is spread widely on both banks of the river, and the lowest point is 35 meters above sea level. Why are you even asking me this? Did your brain cells get lost during your last nap? Geez…” Definitely rude indeed 🙂

After loading some information about gluten allergies, I was able to ask specific questions. Given that until now we used an API, why not add a user interface. Below is a Flutter project I have been working on which automatically generates a wizard from a JSON description.

I use the text input field to send a request to the local RAG server and afterwards use Markdown formatting to show the results. Here is the result of “You are a funny assistant who explains things very well to five year olds.”:

LlamaEdge is running WASM in a single thread setup. As such if you want to offer a real scalable solution, this will not work because two concurrent requests would result in the second being blocked until the first completes. A solution for this is to add a connection pool which allows multiple requests to be handled at the same time.

What is still missing is to make more complex requests. For this we use LangChain which allows complex AI steps to be “stitched together” into a chain. The next AI explained will be about LangChain.

See other posts in the AI explained series:

If you are more interested in understanding in the business side of AI:

If your business needs help with AI, why don’t we connect?

AI explained: RAG — retrieval augmented generation

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Maarten Ectors

No responses yet