RAG is Dead … Long Live RAG

“RAG is dead” is something you hear more and more often. And in a way, it is true — classic RAG as many people implemented it in 2023 does not scale, does not explain itself, and often breaks in subtle ways.

But RAG itself is not dead. It has evolved, and combinations such as GraphRAG have emerged. My relevant blog post How to Build a Knowledge Graph RAG Agent Locally with Neo4j, LangGraph, and watsonx.ai.

This post is a technical experience report from real work with Retrieval-Augmented Generation systems — built locally, connected to IBM Cloud, using IBM Granite models and Milvus as a vector database. It is not a marketing article, and it is not a hype post. Everything shown here is based on things that actually worked, things that failed, and things that turned out to be more important than expected.

We start from zero and build up step by step: from basic concepts, to ingestion pipelines, to retrieval strategies, to agent-based and multi-tool RAG patterns, all the way to a reproducible, testable end-to-end setup.

Along the way, we explicitly talk about:

where RAG systems usually break,
why metadata is more important than most people think,
how token limits silently destroy retrieval quality,
and why “just adding an agent” does not magically fix bad retrieval.

The title “RAG Is Dead … Long Live RAG” reflects this journey. RAG 1.0 — simple retrieve-and-summarize pipelines — is no longer sufficient. What replaces it is a more deliberate, structured, and observable approach: optimized queries, multiple retrieval modes, tools, agents, and strong emphasis on reproducibility and debugging.

If you are looking for a copy-paste tutorial, this post is not for you. If you want to understand why modern RAG systems behave the way they do, how to design them cleanly, and how to avoid common traps, then this guide is written exactly for that purpose.

PART I — Foundations

What This Post Is About
Why “RAG Is Dead … Long Live RAG”
Core Concepts (LLMs, Embeddings, Vector DBs, RAG 1.0 → RAG 2.0)
IBM Granite Models Overview
Local vs. Cloud Execution Model (Our Architecture)

PART II — Starting From Zero

Setting Up Your Environment (Local + IBM Cloud)
Preparing Your Data
Preprocessing & Chunking
Metadata Strategy (Critical!)

PART III — Building Retrieval

Milvus Collections, Indexes & Search Types Explained
Ensuring Embedding Dimensions Always Match
Query Optimization (Pre-Analysis Prompting)
Token Length Effects & How To Fix Them

PART IV — The 7 RAG Invocation Patterns (Options)

Pure Vector Search
Retrieve + Summarize (One Prompt)
Agent-Based RAG (Interactive)
Optimized Query + One-Prompt RAG
Optimized Query + Agent RAG
Multi-Tool RAG (Vector Search + Summaries + Optimizers)
Full Hybrid Mode (Everything Combined)

PART V — Putting It All Together

Building a Reproducible RAG Test Framework
Storing Metadata for Full Traceability
Replay Mode (Deterministic RAG)
Logging, Evaluation & Debugging

PART VI — Full Local Python + IBM Cloud Implementation

Local Environment (Python)
Using IBM Granite Embedding Models (Cloud inferencing)
Milvus Cloud Setup (IBM Cloud Deployment)
Local End-to-End RAG Pipeline (All 7 Modes Implemented)
GitHub Ready Project Structure
References, Transparency Notes & Legal Considerations

PART I — FOUNDATIONS

1.1. What This Post Is About

This post is your Zero-to-Hero guide for building modern Retrieval-Augmented Generation (RAG) systems using:

IBM Granite models on IBM Cloud
Milvus (watsonx.data on IBM Cloud hosted)
Local Python runtime (your laptop / local server)

This post provides you with a guide :

IBM Granite Models
Milvus Vector Database on IBM Cloud
Local development
Transparent, reproducible pipelines
Test frameworks
Agent-based multi-step RAG

and contains:

fundamentals
architecture
code
debugging
reproducibility
test frameworks
advanced RAG patterns

Everything in one place. No searching for internet fragments. Hopefully, “no missing pieces.”.

1.2. Why “RAG Is Dead … Long Live RAG”

Retrieval-Augmented Generation is not dead.

1.2.1. Challenge still exists

Large Language Models have three hard limits:

They forget (without a framework)
They hallucinate
They do not know your private data

No matter how big the model is:

It does not know your PDF
It does not know your database
It does not know your tickets, logs, or emails

RAG solves exactly this:

“Bring your knowledge to the model at runtime.”

1.2.2. Confusion

Every few months, a new headline appears:

“RAG is dead”
“Agents replace RAG”
“Fine-tuning beats retrieval”
“Just use context windows”

And every time, the same thing happens:

Sometimes people confuse tools with architecture.

RAG is not a product. RAG is not a framework. RAG is not a model.

1.2.3. A Pattern

RAG is a pattern. A pattern that simply means:

“Before the LLM answers, it retrieves knowledge from somewhere else.”

That is all.

But the old way of doing RAG is definitely dead. Smart RAG replaced it.

Traditional RAG (“RAG 1.0”) looked like this:

User Question → Embedding → Vector Search → LLM Summarizes → Answer

Modern RAG is more like this:

Pre-Analysis → Query Optimization → Retrieval → Reranking → Reasoning → Tools → Final Answer

1.2.4. Traditional RAG Overview

The following images display the basic architecture. We start with traditional RAG. Figure 1 shows the complete end-to-end RAG process, including offline indexing and query-time retrieval. This separation is the foundation of everything discussed in the rest of this post.

Figure 1

The figure 2 displays the RAG Ingestion Pipeline and retrieval.

Figure 2

The figure 3 shows an example of multiple LLM interactions.
FYI: A RAPTOR retriever adds hierarchical, LLM-generated summaries to your vector index so that a RAG system can retrieve information at the right level of abstraction, not just the closest chunk.

Figure 3

1.3. Core Concepts and Termologies

LLM → Large language model (IBM Granite in our case).
- The LLM is the reasoning engine.It does NOT:
  - Search
  - Retrieve
  - Remember
  It only:
  - Reads what you give it
  - Produces an answer
Embedding → Turns text into numerical vector.
- Embeddings are numbers that represent meaning.
  Example:
  “What is RAG?”
  → [0.0123, -0.9912, 0.4421, …]
- Key rules:
- Same meaning → similar numbers
- Different meaning → far away numbers
- - The vector size is fixed
- - The text length does NOT change that size
- This is the bridge between:
  - Human language
  - Machine search
Vector DB → Stores embeddings, performs similarity search.
- A vector database stores:
  - Embeddings
  - Metadata
  - Chunk text
  It allows:
  - Similarity searchFilteringRankingHybrid retrieval
  Examples:
  - Milvus
  - Elastic (vector mode)
  - Weaviate
  - PGVector
Chunk → Small piece of document.
Metadata → Information about each chunk (VERY IMPORTANT).
RAG → Retrieve → Augment → Generate

Note: All of these concepts appear together in the full RAG shown in Figure 1.

1.4. IBM Granite Models Overview

IBM Granite (2024+) includes:

Embedding Models
Granite Embedding 30B
Granite Code Embeddings
Granite Multilingual Embeddings

LLMs

Granite 13B
Granite 34B
Granite 70B (Cloud)

1.5. Architecture: IBM Cloud + watsonx + Local

Extraction, Preprocessing and ingestion

Figure 4

Retrieval with a local tool implemented in as a REST API or MCP server.

Figure 5

Retrieval with an agent and tool implemented as a REST API or MCP server. The Tools Server running in IBM Cloud Code Engine can also run locally as a container or application. (watsonx Orchestrate, watsonx.ai, watsonx.data, Code Engine )

Figure 6

Choice for the implementation:

Local Code Execution
watson.ai IBM Granite Models
Milvus Vector DB in watsonx.data on IBM Cloud

1. Local will handle:

preprocessing
chunking
ingestion
query optimization
agent orchestration
testing
reproducibility

2. Cloud will handle:

embedding generation
LLM generation
vector search
data storage

PART II — STARTING FROM ZERO

All following code snippets are illustrative and focus on structure and integration, not production-ready execution.

2.1. Setting Up Your Environment

This is how we can organize a folder structure for an example project. (Local Python + IBM Cloud)

Folders:

rag-structure/
 ├── src/
 │    ├── embeddings.py
 │    ├── vectorstore.py
 │    ├── rag_modes.py
 │    ├── agent.py
 │    ├── utils.py
 │    └── replay.py
 ├── data/
 ├── configs/
 ├── tests/
 └── README.md

2.2. Preparing Your Data

You should never ingest raw documents.

Data preparation includes:

Extract text
Remove tables, signatures, junk
Normalize whitespace
Fix unicode issues
Detect language

2.3. Preprocessing & Chunking

Chunking rules:

~300–600 tokens usually best
Overlap of 10–20%
Chunk boundaries must respect semantic units

2.4. Metadata Strategy

Metadata is the SINGLE most important part of modern RAG. Every chunk should include:

source_name
source_type
timestamp
author
heading
paragraph index
tags
keywords
embedding model version
chunking version

2.5. Ensure you have your Ground Truth ready

In almost every RAG project we worked on, the hardest part was not the model, not the vector database, and not the framework. The real challenge was always the same. You can skip this chapter, because it is a bit one.

2.5.1 What is our ground truth?

If you take only one thing from this chapter, it should be this:

A RAG system is only as good as its golden data: the golden questions, the golden answers, and the documents that justify them.

This problem did not disappear with better models. In fact, it became more visible.

Figure 7

My blog post Land of Confusion using Classifications, and Metrics for a nonspecific Ground Truth is related to this topic.

2.5.2 What We Mean by Ground Truth in RAG

In classical machine learning, ground truth usually means labeled data. In RAG, ground truth is more subtle and consists of three tightly connected parts:

Golden Questions
Golden Answers
Golden Data (Source Documents)

They must be consistent with each other. If one part is weak, the entire system becomes unreliable.

2.5.3 Golden Questions

Golden questions define what the system must be able to answer.

Good golden questions are:

Realistic (asked by real users or domain experts)
Unambiguous
Grounded in the available data
Stable over time (or at least versioned)

Bad golden questions are:

Vague or underspecified
Hypothetical questions without supporting documents
Questions that require reasoning beyond the data you actually store

In practice, we learned that collecting golden questions is a business activity, not a technical one. Engineers can help structure them, but domain experts must define them.

2.5.4 Golden Answers

Golden answers are not just “the correct text”.

A good golden answer:

Is factually correct
Is aligned with the business domain language
Can be justified by specific documents
Is scoped (not overly verbose, not overly short)

This is where many RAG evaluations fail. If your golden answer is vague or poorly written, you cannot reliably judge the system output later.

We strongly recommend storing:

The golden answer text
Metadata (version, author, date)
References to the supporting documents

Golden data answers a simple but uncomfortable question:

2.5.5 Which documents are allowed to be used to answer this question?

For each golden question, you should ideally know:

Which documents must be retrieved
Which documents must not be retrieved
Which document passages are actually relevant

This is critical because many RAG systems fail silently: They produce a good-looking answer based on the wrong sources.

If you do not check document correctness, you are testing text generation, not retrieval.

2.5.6 Why This Is Still the Hardest Problem

Even with better embeddings, larger context windows, and agents, the problem remains:

Data changes
Policies change
Documentation quality varies
Business language evolves

There is no “one-size-fits-all” solution here. Ground truth must be treated as a living asset, not as a static test dataset.

2.5.7 Synthetic Data Generation (With Care)

To scale evaluation, we often add synthetic data generation.

Typical use cases:

Generate paraphrases of golden questions
Generate variations with different wording
Generate negative examples (questions that should not match)

This helps with coverage, but it comes with risks.

Synthetic data is useful for:

Stress testing retrieval
Improving robustness
Detecting embedding weaknesses

Synthetic data is not a replacement for real golden questions.

Our rule of thumb:

Synthetic data may extend ground truth, but it must never define it.

2.5.8 Model as a Judge (Instead of Confusion Matrices or other metrics for AI quality)

In older systems, we relied heavily on confusion matrices: precision, recall, F1, accuracy.

In RAG, this often breaks down.

Why?

Answers are not binary
Partial correctness matters
Language quality matters
Source correctness matters

Today, a more practical approach is model-as-a-judge.

Typical evaluation dimensions include:

Is the answer correct?
Is the answer grounded in the retrieved documents?
Are the correct documents used?
Is the answer complete and not hallucinated?

A strong LLM, prompted carefully, can evaluate these dimensions far better than traditional metrics.

Important: The judge model must be separate from the generation model to avoid bias.

2.5.9 Validation, Not Illusion of Precision

Model-as-a-judge does not give you a single “accuracy number”. And that is a good thing.

Instead, you get:

Structured feedback
Qualitative explanations
Actionable signals for improvement

This shifts evaluation from false precision to practical validation. My blog post Land of Confusion using Classifications, and Metrics for a nonspecific Ground Truth is related to this topic.

2.5.10 Key Takeaways

Ground truth in RAG means golden questions, answers, and data — together
Data quality is still the dominant success factor
Synthetic data helps, but only as an extension
Model-as-a-judge is more useful than confusion matrices
Evaluation must focus on correctness and grounding

In the end, RAG is not primarily a modeling problem. It is a data and validation problem — and it will remain one for the foreseeable future.

PART III — BUILDING RETRIEVAL

3.1. Milvus Collections, Indexes & Search Types

The figure 7 displays the basic dependencies in Milvus for more details please visit the Milvus documentation

Figure 8

Field Schema

vector_embedding = FieldSchema(
    name="vector_embedding",
    dtype=DataType.FLOAT_VECTOR,
    dim=1024
)

Index Types
- HNSW → fast, accurate
- IVF_FLAT → scalable
- FLAT → exact
Search Metrics
- COSINE (recommended)
- IP
- L2
Example Search Source Code

search_vector_embedding = embed_text(search_text)
res = self.client.search(
                collection_name=collection_name,
                anns_field="vector_embedding",
                data=[search_vector_embedding],
                limit=self.search_limit,
                search_params={"metric_type": "COSINE"}
            )

3.2. Ensuring Embedding Token Size And Dimension Always Matches

Embedding size is fixed per model.

Example: Granite Embedding Model → 1024 dims As long as documents and questions use SAME MODEL, dimensions always match.

3.3. Query Optimization

Critical for modern RAG. The query optimization is very important because today with the huge context input sizes so the question and the related information can be much longer than the allowed tokensize for the embedding, this is the reason why the next chapter is so important. We need to take care about the token length and dimensions.

Prompt:

Rewrite the question to be short, unambiguous, and optimized for embedding-based vector retrieval.
Keep only the semantic core.

3.4. Token Length Effects

Long questions → semantic dilution. Fix:

rewrite
extract core meaning
remove story details
reduce noise

PART IV — 7 RAG PATTERNS

Basic overview.

Basic RAG usage in combination with other data resources and an inference platform.

Figure 9

Retrieval Agent

Figure 10

Retrieval Agent Outline Options

Figure 11

RAG configuration options:
- 1. Pure Vector Search The simplest setup, focusing only on semantic similarity search over embeddings. It is fast and easy to operate, but relies heavily on high-quality chunking and embeddings, with no reasoning or synthesis step.
- 2. Classic RAG (Retrieve + Summarize Once) Documents are retrieved once and summarized in a single LLM prompt. This pattern is widely used, easy to understand, and sufficient for many use cases, but limited when questions require decomposition or deeper reasoning.
- 3. Agent RAG An agent controls retrieval and reasoning across multiple steps. This enables iterative querying, decision-making, and adaptive behavior, but increases complexity, cost, and the need for strong observability and guardrails.
- 4. Optimized Query + Single Prompt The user query is rewritten or enriched before retrieval, then answered in one prompt. This improves recall and relevance while keeping the architecture simple and predictable and testable.
- 5. Optimized Query + Agent Query optimization is combined with an agent-based execution flow. This approach balances better retrieval quality with flexible reasoning, but requires careful tuning to avoid unnecessary agent loops.
- 6. Multi-Tool RAG The RAG system can invoke multiple tools such as vector search, SQL, APIs, or domain services. This is the most powerful pattern, but also the most demanding in terms of data quality, ground truth, and operational discipline.
- 7. Full Hybrid Mode (Everything Combined)
You expose:
- Query rewrite tool Transforms the user input into a clearer, more retrievable query. This step improves recall and reduces ambiguity, especially for short or underspecified questions. For example doing a preanalysis of context of a question.
- Retrieval tool Encapsulates access to vector stores, hybrid search, or structured backends. Its quality is primarily determined by data preparation, chunking strategy, and metadata design.
- Summarization tool Condenses retrieved content into a focused context for the LLM. This helps control token usage and reduces noise, but can hide missing or weak source data.
- Agent with tool-selection logic Decides which tools to call and in which order based on the task. This enables flexible reasoning and orchestration, but shifts complexity from prompts to system design and evaluation.

PART VI — FULL IMPLEMENTATION

4.1. ARCHITECTURE IMPLEMENTATION

4.1.1. Retrieval Augmented Generation

Figure 12

4.2. Local Python Environment

python-milvus
ibm-watsonx-ai
pydantic
fastapi

4.3. Example using IBM Granite Embeddings in watsonx.ai on IBM Cloud

from ibm_watsonx_ai.foundation_models import Model
from ibm_watsonx_ai.foundation_models import Embeddings
from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.metanames import EmbedTextParamsMetaNames as EmbedParams

embed_params = {
            EmbedParams.TRUNCATE_INPUT_TOKENS: int(os.getenv("WATSONX_EMBEDDING_MODEL_TRUNCATE_INPUT_TOKENS")),
            EmbedParams.RETURN_OPTIONS: {
            'input_text': True
            }
        }

granite_embed = Model(
    model_id="ibm/granite-embedding-30b",
    credentials=creds,
    params=embed_params,
    project_id=self.project_id
)

def embed_text(text):
    res = granite_embed.embed(text)
    return res['vector']  # 1024 dims

4.4. Milvus watsonx.data (IBM Cloud)

Generate a Milvus client.

from pymilvus import MilvusClient

client = MilvusClient(uri=os.getenv('MILVUS_URI'),
                      user=os.getenv('MILVUS_USER'),
                      password=os.getenv('MILVUS_APIKEY'),
                      database=os.getenv('MILVUS_DB_NAME'))

4.5 Use a prompt template on watsonx.ai in a watsonx.ai deployment

You can use a watsonx.ai prompt template deployed on watsonx.ai to generate the summary. Also useful in this context is the blog post AI Prompt Engineering: Streamlining Automation for Large Language Models.

token = f"Bearer  {self._get_bearer_token()}"
header = {"Content-Type": "application/json", "Authorization": token }
url = f"https://{self.watsonx_config['WATSONX_REGION']}.ml.cloud.ibm.com/ml/v1/deployments/{self.watsonx_config['WATSONX_DEPLOYMENT_ID']}/text/generation?version=2021-05-01"
timer0 = time.time()
try:
    response_error_analysis = requests.post(
    url,
    data= json.dumps(payload_prompt_template), 
    verify= False,
    headers= header, 
)

4.6 Use an agent in watsonx Orchestrate (locally or remote)

In this context it is useful to understand the watsonx Orchestrate REST API an easy starting point can be my blog post REST API Usage with the watsonx Orchestrate Developer Edition locally: An Example.

def chat_completion(self, question:str, rag_summany: str):
    agent_id = self.watsonx_config['WATSONX_ORCHESTRATE_AGENT_ID']
    base_url = self.watsonx_config['WATSONX_ORCHESTRATE_BASE_URL']
    endpoint = f"v1/orchestrate/{agent_id}/chat/completions"
    self.agent_url = f"{base_url}/{endpoint}"
    token = self._get_bearer_token()
        
    content = f"Question:\n{question}\nRAG summary:\n{rag_summany}\n"
    payload = { "messages": [
    {
        "role": "human",
        "content": content
    }
    ],
    "additional_parameters": {},
    "context": {},
    "stream": False
    }
        
    headers = {
        "Authorization": f"Bearer {token}",
        "Content-Type": "application/json"
    }
                
    # Invoke the agent in watsonx Orchestrate
    try: 
        response = requests.request("POST", 
                                    self.agent_url, 
                                    json=payload, 
                                    headers=headers)
        if response.status_code == 200:
            return response_agent
        else:
            return response
    except Exception as e:
        response = f"Exception:\n{e}"
        return response

4.6. Full Local RAG Pipeline (All Options)

def rag_pipeline(question, option):
    q_opt = optimize_question(question)
    q_emb = embed_text(q_opt)

    hits = milvus_search(q_emb)

    if option == 1:
        return hits

    if option == 2:
        return llm_summarize(question, hits)

    if option == 3:
        return agent_process(question, hits)

4.7 Potential github repository structure

rag-zero-to-hero/
 ├── src/
 │    ├── embeddings.py
 │    ├── vectorstore.py
 │    ├── rag_options.py
 │    ├── agent.py
 │    ├── optimizers.py
 │    ├── config.py
 │    ├── replay.py
 │    ├── evaluation.py
 │    └── main.py
 ├── configs/
 ├── tests/
 ├── data/
 ├── docs/
 ├── README.md
 └── LICENSE

5. Summary

With this post, I am not trying to define the perfect RAG architecture or claim that it will still look the same tomorrow. The reality is that everything around RAG changes extremely fast: APIs, frameworks, programming models, agents, tools, and even the way we think about large language models. What is relevant today may be obsolete in a few months.

The goal of this post is therefore different. I want to create something as close as possible to a “one-size-fits-most” foundation for Retrieval-Augmented Generation as it is relevant today. A single reference point that brings together the essential concepts, patterns, and architectural decisions that consistently matter across projects — regardless of which specific framework, SDK, or model happens to be fashionable at the moment.

This post is meant to ensure that the basics are truly understood: what RAG actually is (and is not), where it breaks, why naïve implementations fail, and which design choices have long-term impact. It deliberately focuses on fundamentals such as data preparation, chunking, metadata, embeddings, retrieval strategies, token limits, observability, and reproducibility — because these aspects remain stable even as tooling evolves.

The examples in this post use IBM technologies (IBM Granite models, watsonx.ai, watsonx.data, Milvus, and watsonx Orchestrate) for a simple reason: this is the ecosystem I know best and have used extensively in real projects. The architectural patterns and lessons, however, are not IBM-specific and can be transferred to other platforms with minimal adaptation.

Across multiple RAG projects, one challenge has consistently dominated all others: data quality and ground truth. Not models, not agents, not vector databases — but the availability of reliable source data, well-defined questions, and trustworthy answers. Even with emerging approaches such as synthetic data generation or “model-as-a-judge” evaluation, this problem does not disappear. It merely changes shape, and it will remain a central challenge for the foreseeable future.

This post is therefore not a promise of future-proof solutions. It is a consolidation of hard-earned lessons from real RAG systems, intended to give readers a stable mental model and a solid starting point — so that when the next framework, API, or agent abstraction appears, they can evaluate it with clarity instead of starting from zero again.

I hope this was useful to you and let’s see what’s next?

Greetings,

Thomas

FYI: References, Transparency Notes & Legal Considerations for this post
This article is based on my own practical experience with Retrieval-Augmented Generation systems. I used AI tools selectively to help structure and refine the text. All technical content, architectural decisions, and conclusions reflect my own understanding and judgment.

#RAG, #Retrieval-AugmentedGeneration, #RAGArchitecture, #LargeLanguageModels, #VectorSearch, #AgenticAI, #Query Optimization, #GroundTruthData, #GoldenData, #SyntheticDataGeneration, #ModelAsAJudge, #AISystemDesign, #Enterprise AI, #IBMwatsonx