RAG Explained

Watch the video

When using ChatGPT, you most probably have encountered responses like, “I’m sorry, but as of my last knowledge update in January 2022,” or even responses that are not true at all. This is where RAG comes in, and says, let me help by injecting more knowledge (or content) into your interactions with an LLM and help it answer the unknown and upcoming queries.

We hear LLMs, prompts, and RAG everywhere. By now, I think most of us know what an LLM and a prompt is. But did you know that right now, RAG is just as important as both of these and powers most applications you may use involving a chatbot? I recently did a poll in our Learn AI Together Discord community to find out if people had already studied, created or used RAG applications, and most voted to understand what RAG is used for. RAG is as important as your coursebook for success in a class, so understanding what it is is highly relevant in AI.

An LLM, or a large language model, is just an AI model trained on language to talk with humans, like GPT-4 used in ChatGPT. A prompt is simply your interaction with it. It’s the question you ask it.

But if you are experiencing issues like hallucinations or biases using such a language model, or LLM, then RAG, or retrieval augmented generations, comes in.

Let’s quickly clarify hallucinations first. It’s when the model returns random things that seem true but aren’t, simply because it doesn’t know the answer. In fact, a language model is constantly hallucinating. It only predicts words in a statistical way. It turns out that when they are trained with the entire internet, there are so many examples that they manage to accurately predict the next logical words to answer most questions. Despite this, it hallucinates. It doesn’t really understand what it’s talking about and just outputs one word at a time that is probable. What is incredible is that most of these hallucinations are actually true and answer our questions. However, some of them are real hallucinations of fabricated facts or scenarios, and that can cause quite a few problems if they are not sufficiently controlled.

While there are several reasons why LLMs hallucinate, it is mostly because they lack relevant context, either because they cannot find the relevant data or don’t know which data to refer to for a particular question. This is because they were trained to answer and not to say, “I don’t know”. RAG solves this by automatically adding more knowledge (or content) into your interactions with an LLM.

Put simply: you have a dataset (which is required), and you use it to help the LLM answer the (unknown and upcoming) user queries. This is the simplest form and requires a few steps to make it work, but this is the gist of a RAG-based system:

user question -> automatic search in a database for relevant information -> give back the question + relevant info found from dataset to LLM -> answer user

As you can see, with RAG, we use context from the user question and our knowledge base to answer it. This helps with grounding our model to the knowledge we control, making it safer and aligned. The disadvantage is limiting our answers to our knowledge base, which is finite and probably not as big as the internet. It’s just like an open-book exam you would have in school. You already have access to most answers and simply need to know where it is in your knowledge base. If you find the answer in the manual, it’s quite hard to fail the question and write something wrong!

Jerry Liu, CEO of LlamaIndex gave a very interesting view on how to see RAG in my recent podcast episode with him:

If you think about it, rag is basically prompt engineering, because you’re basically figuring out a way to put context into the prompt. It’s just a programmatic way of prompt engineering. It’s a way of prompting so that you actually get back some context [from a database of yours, and in an automated way].

He also said to subscribe to the channel to learn more about AI! Ok, maybe that’s just a hallucination, actually… but you should still do it, honestly.

In RAG, you first need data, or knowledge, which can be in the form of documentation, books, articles, etc., and only allow the LLM to search it and respond if the answer to the question is inside this knowledge base. Anyways if you have access to accurate information as in your school manual, why would you try to come up with something different instead?

This is (currently) the best way to control your outputs and make your model safer and aligned. Basically, the best way to ensure you will give the right answer and get your best grade.

For example, we recently built an AI tutor to answer AI-related questions. We wanted accurate responses for our students both in terms of accuracy (give the right answer) and in terms of relevancy (up-to-date information). With RAG, you can simply update your database if things have changed. There’s no big deal if the whole PyTorch library had a big update yesterday: scrape it again and update your dataset. Voilà! You don’t have to retrain a whole model or wait for GPT-4 to finally update the knowledge cutoff date!

The overall process of the bot is quite straightforward; we validate the question, ensuring it is related to AI, and that our chatbot should answer it, then we search in our database to find good and relevant sources and finally use ChatGPT to digest those sources and give a good answer for the student.

If you need safe information from an AI chatbot like a medical assistant, a tutor, a lawyer, or an accountant, you will be using RAG for sure. Well, maybe not if you are listening in 2030, but as of now, RAG is by far the best and safest approach to using a chatbot where you need factual and accurate information

If you are a more technical reader:

To build a RAG-based chatbot or application like our AI tutor, we start by ingesting all our data into memory. This is done by splitting all the content into chunks of text (split our textual data into fixed or flexible parts, for example, 500-character parts) and processing it to an embedding model, like OpenAI’s text-embedding-ada-002 model. This will produce embeddings that are just vectors of numbers representing your text. It will facilitate your life and allow you to compare text together easily. You can save those vectors in a memory. Then, for a new question from a particular user, you can repeat this process and answer. This means embedding the question using the same approach and comparing it with all of your current embeddings in your memory. Here you are basically looking for the most probable answer for this question searching in your memory, just like you’d do for an exam looking through chapters to find a title that seems relevant to the current exam question.

Once it finds the most similar embeddings, ChatGPT is asked to understand the user’s question and intent and only use the retrieved sources of knowledge to answer the question. This is how RAG reduces hallucination risks and allows you to have up-to-date information since you can update your knowledge base as much as you want, and ChatGPT, or your current language model, simply picks information from it to answer.

Plus, as you see, it cites all sources it found on the question for you to dive in and learn more, which is also a plus when you are trying to learn and understand a new topic!

Then, there are still many things to consider, like how to determine when to answer a question or not, if it is relevant or in your documentation, understand new terms or acronyms not in ChatGPT’s knowledge base, find the relevant information more efficiently and accurately, etc. Those concerns are all things we’ve improved through using various techniques like better chunking methods, rerankers, query expansion, agents, and more that you can learn about in our free advanced RAG course we’ve built with Towards AI and Activeloop that I linked below.

Before some of you may ask, yes, an alternative to RAG would be to fine-tune your model on your specific task. Basically, to further train a model on your own data to make it more specific and ingest the knowledge you have rather than always searching in it. Like memorizing the book before the exam instead of bringing it. I have a video comparing fine-tuning and RAG to teach you when you should consider each, but in short, RAG stays relevant with or without fine-tuning as it is much cheaper to build and is better to reduce (undesired) hallucinations as you force the model to give answers based on documentation YOU control, and not simple things it ingested and hopefully will regurgitate correctly as in fine-tuned models. Coming back to open-book exams, it’s just like professors making you focus on understanding the core matter and logic and not the knowledge itself, as you can always find it in manuals or on Google. Same here for LLMs and complementing them with RAG. Plus, even though those models have much better memories than us, they are not perfect, and they will not retain all the information you give them. Thus, even with a fine-tuning model on hyper-specific data, RAG remains something worth leveraging.

Before we end this video, I just wanted to mention that we discussed both these topics in-depth with coding examples in our LLM and RAG courses if you want to put this knowledge into practice! The link is in the description below.

I hope you’ve enjoyed this video and that it helped you understand the goals and principles of RAG better. If you did, please share it with a friend or your network to spread the knowledge and help the channel grow!

Thank you for reading!

Louis


References

► Jump on our free RAG course from the Gen AI 360 Foundational Model Certification (Built in collaboration with Activeloop, Towards AI, and the Intel Disruptor Initiative): https://learn.activeloop.ai/courses/rag 

►Twitter: https://twitter.com/Whats_AI 

►My Newsletter (My AI updates and news clearly explained): https://louisbouchard.substack.com/ 

►Support me on Patreon: https://www.patreon.com/whatsai 

►Join Our AI Discord: https://discord.gg/learnaitogether 

How to start in AI/ML — A Complete Guide: ►https://www.louisbouchard.ai/learnai/