Long Context Models Explained: Do We Still Need RAG?

Watch the video!

Many people say that RAG is dead now that we see all the new models coming out with large context windows, like GPT-4o Mini, which can process up to 128,000 input tokens, or, worse, Gemini 1.5 Pro, which can process 2 million tokens. For context, 2 million tokens are is equivalent to 3,000 pages.

So, do we still need to do retrieval-augmented generation, knowing that better models will continue to emerge in the short and long term with an increased context window and capabilities?

I don’t think so. To understand why, it’s important to understand the benefits and trade-offs of using large context models compared to building a RAG pipeline so you know when and why to spend time and resources developing one.

For example, let’s say you’re creating a personal AI writing assistant that needs access to your entire collection of written articles or books. What would you do?

Or you’re developing an AI financial assistant that needs to analyze financial reports. Which method will interpret the reports correctly, including the figures?

Knowing the pros and cons of using Long Context Models versus building a performant RAG system will save you time and money.

Since context windows are getting bigger and bigger and models are becoming multimodal, you may not want to waste time developing a retrieval pipeline for your specific use case, especially for a one-off task, and that makes perfect sense. But knowing how to best leverage these approaches will come in handy sooner or later, when the right application comes along.

First, let’s answer the question of what long context models are and what their benefits and drawbacks are.

Long-context language models are AI models capable of processing and reasoning over significantly larger amounts of input text than traditional, previously released LLMs. These models can handle contexts up to hundreds of thousands to millions of tokens in a single prompt, allowing them to ingest and analyze entire documents, books, databases, or collections of information at once. Many here quoted the death of RAG when Gemini released the million+ token context window.

GPT-4’s version from March 14th, 2023, could only process up to 8k tokens. Now, in july 2024, a year later gpt-4o-mini a smarter and cheaper replacement of gpt-3.5-turbo can process up to 128k tokens. The recent Llama 3.1 suite of models also each has 128k token context windows. There are also recent models like Gemini-1.5-pro, which can process up to 2M tokens. This is ~3,000 pages of text!

This expanded context window enables LLMs to potentially perform tasks that traditionally require external tools or specialized systems, such as information retrieval, multi-document reasoning, and complex query answering, all within a single model.

This is particularly valuable, for example, when feeding an entire code base into context, where the model’s understanding benefits from seeing the full repository and how it is connected.

Working with a long context is also good when an extended processing time is not an issue. These models process a large number of tokens through an iterative process with a smaller amount sequentially processed until the whole input length is completed, saving knowledge from each sub-part in an encoded format. I covered this in my infinite-attention article if you are interested in learning more about how they could achieve that.

So then, what are the benefits of RAG?

RAG is an excellent technique for handling more extensive collections of documents that cannot fit within a single LLM context window.

Contrary to some popular beliefs, well-made RAG systems are fast and accurate. Queries to a database with multiple documents are processed quickly due to efficient document indexing methods. When dealing with lots of data, this search process is much lighter compared to sending all the information directly to an LLM and trying to “find the useful needle” in the stack of data. With RAG, we can selectively include relevant information to the initial prompt. Thus reducing the noise and potential hallucinations. As a bonus, RAG allows for the use of advanced techniques and systems, such as metadata filtering, graphs and hybrid search, to enhance performance and not solely depend on an LLM.

So which one is better?

Generally I think there is a place for using and combining both methods. Long context models simplify the overall process by reducing the need for complex RAG techniques, as they can handle larger chunks of information at once. This can improve the chances of including relevant information and reduce the need for extensive evaluation. 

However, RAG remains valuable, especially when dealing with large datasets, when processing time is critical, or when cost-effectiveness is a priority. RAG is particularly useful when using LLMs through APIs, as it’s more efficient and cost-effective to retrieve and send only the most relevant information rather than processing vast amounts of text. 

Long context models may be preferable for one-off tasks, for smaller datasets (e.g., analyzing one or two PDFs) or when handling fewer prompts per hour, as they can be more cost-effective when considering the costs of building a performant RAG pipeline. 

The key differences lie in how information is added to the initial prompt. RAG adds only relevant information, potentially limiting hallucinations and noise, while long context models include all available information, placing more responsibility on the LLM to process it effectively. In practice, RAG is well-suited for applications like customer support systems and real-time data integration, while long context models excel in tasks involving complex multi-document analysis and summarization. 

Ultimately, the choice between these approaches depends on the application’s specific needs and constraints. So, to answer the original question, RAG isn’t dead. Both methods have strengths in different scenarios, so it’s ultimately up to you. Here’s a table to help you decide how to proceed with your application.

Now that you have a better idea of the benefits/trade-offs of each method make sure to check out our course, where we discuss this topic in depth. You will learn about numerous advanced techniques like KV caching, a new method Google recently released that leverages LongContext models and makes the process a bit more efficient.

Thank you for reading throughout. If you are looking to learn more about RAG systems and LLMs, enroll in our new “Beginner to Advanced LLM Developer” course now!