The Cup Is Not the Coffee: What Data Quality Means in the AI Era

Today, many AI systems are built from impressive technical components: MCP servers, vector databases, graph databases, agents, tools, and large context windows.

In practice, however, I increasingly notice that the biggest mismatch is often not in the orchestration. It is in the data. If the available data is incomplete, outdated, badly extracted, inconsistent, untrusted, or simply not aligned with the real user expectation, the AI system will fail — even when the architecture looks modern and technically strong.

That is why I like this simple picture:

Because in the end:

The cup is not the coffee.

But without a good cup you cannot drink the coffee —
and without good coffee, the cup has no value.

Reliable AI systems require both.
And in AI systems, the coffee is the data.

A beautiful cup does not fix bad coffee.
In the same way, a modern AI architecture does not fix weak data.

Table of contents

  1. Why I think this matters now
  2. RAG versus large context is not the real question
  3. Data quality in the AI era
    1. Relevant dimensions of data quality
      1. Relevance
      2. Correctness
      3. Completeness
      4. Consistency
      5. Timeliness
      6. Traceability
      7. Ownership
      8. Integrity
      9. Usability for AI
  4. What is golden data?
  5. Tools like Docling help a lot — but they do not solve the whole problem
  6. A simple example from beginning to end
    1. The available data
      1. Step 1 — Extraction
        1. Pros
        2. Cons
      2. Step 2 — Chunking and indexing
        1. Pros
        2. Cons
      3. Step 3 — Agent behavior
        1. Pros
        2. Cons
      4. Step 4 — The result
  7. What this example shows
    1. Relevance
    2. Golden data
    3. Mismatch between data and expectation
    4. Ownership
    5. Compromised or degraded data
    6. Who collects the data
    7. AI context
  8. Why the same data performs differently across AI architectures
    1. In a long context prompt
    2. In RAG
    3. In an agent workflow
      1. In an agent-based architecture, multiple tools and steps may be involved.
  9. Summary
  10. References

1. Why I think this matters now

In recent years, I have seen more and more AI systems built from strong technical components.

The components are often not the problem.

The bigger problem is usually somewhere else:

  • The available data does not match the real task. For example, the system has general policy information, but the user asks about a region-specific case.
  • The important exception is missing. For example, the main rule is indexed, but the promotional exclusion is not.
  • The extracted data lost structure. For example, a table with conditions, notes, and exceptions is reduced to flat text.
  • The source is not authoritative. For example, the system retrieves a copied wiki page instead of the approved source of truth.
  • The content is outdated. For example, an older process description is still available and sounds valid, although the process has already changed.
  • Nobody really knows who owns the data. For example, several teams contribute information, but no team is clearly responsible for maintaining it.
  • The user expects certainty, but the system only has partial information. For example, the answer sounds reliable, but the underlying knowledge base is incomplete.

This becomes even more important when AI is used in areas where the user cannot easily validate the result.

That is a key point for me.

If a user asks something simple, they may notice when the answer is wrong. But if the AI supports analysis, decisions, recommendations, support tasks, or process guidance, the user often cannot fully verify the outcome.

In these situations, data quality is not just a technical detail.
It becomes a trust topic.

2. RAG versus large context is not the real question

Today, there is a lot of discussion about RAG versus large context windows.

That is a useful technical discussion. But I think it can miss the deeper issue. RAG still depends on good data. Large context still depends on good data. Agents still depend on good data. Graph-based retrieval still depends on good data. Tools exposed through MCP still depend on good data.

It does not matter whether the data is retrieved, injected into context, traversed as a graph, or accessed through a tool. If the underlying data is weak, incomplete, outdated, inconsistent, or not fit for the intended task, the outcome will still be weak.

The technical mechanism changes. The dependency on data quality does not.

A simple example makes this clear.

Imagine the authoritative refund policy is missing, and only an outdated version is available.

In a RAG setup, the system may retrieve the wrong chunk or miss the important exception.
In a large-context setup, the system may include the outdated policy in full and still treat it as valid.

So the failure looks different, but the root problem is the same:
the system does not have the right data in the right quality.

That is why I think the deeper question is usually not:

  • RAG or long context?
  • vector database or graph database?
  • tool call or prompt?
  • agent or no agent?

The deeper question is this:

Is the data fit for the task the AI system is supposed to solve?

More context does not automatically mean better context.

3. Data quality in the AI era

For me, data quality in AI means this:

Data quality means that the data is fit for the intended AI use case.

That sounds simple, but it includes many dimensions.

In classical IT, data quality is often discussed in terms such as correctness, completeness, and consistency. These dimensions still matter in AI systems. But in the AI era, they are no longer enough on their own.

We also need to ask whether the data is relevant for the actual task, whether it includes the important exceptions, whether it comes from a trustworthy source, whether it is still current, and whether it can be used reliably inside an AI pipeline.

In other words, data quality in AI is not only about whether data exists.

It is about whether the data is good enough for the decision, answer, or action the system is expected to support.

In this metaphor, the cup represents the delivery structure around the answer:

  • the agent
  • the MCP server
  • the vector database
  • the graph database
  • the prompt
  • the orchestration

But the value is still in the coffee.
And in AI systems, the coffee is the data.

3.1 Relevant dimensions of data quality

These are some example dimensions of data quality.

3.1.1 Relevance

Does the data actually help answer the user’s question?
A system can have a lot of data and still fail, because the data is not relevant to the actual task.

3.1.2 Correctness

Is the content factually correct?
Wrong facts, wrong versions, wrong mappings, or wrong extracted values directly damage the result.

3.1.3 Completeness

Does the data include the important cases and exceptions?
This is often where practical systems fail.
The main rule is present, but the exception is missing.

3.1.4 Consistency

Do all sources say the same thing?
If one source says one thing and another says something else, the AI system may choose one without understanding the conflict.

3.1.5 Timeliness

Is the data still current?
Old policy documents, old product descriptions, or outdated process steps can produce very confident but wrong answers.

3.1.6 Traceability

Do we know where the data came from?
If we cannot trace the source, it becomes difficult to justify trust.

3.1.7 Ownership

Do we know who owns the data and who maintains it?
This is a very practical question. If nobody owns the data, quality usually degrades over time.

3.1.8 Integrity

Has the data been compromised, manipulated, or damaged?
This matters especially if data is copied across systems, transformed multiple times, or extracted from poor input formats.

3.1.9 Usability for AI

Can the AI pipeline actually use the data reliably?
A human-readable PDF is not automatically AI-ready data.
A table in Excel may look clear to a person, but extraction can flatten or distort its meaning.

4. What is golden data?

A useful and practical term here is golden data.
Golden data is the most trusted and authoritative version of data for a specific purpose. It is the data that defines what the correct answer should be. It is not simply the newest document, but the source that should govern the decision for that use case.

In other words, golden data represents the source of truth for a given task.

For example:

  • the official refund policy, not an old slide deck
  • the approved product specification, not a copied wiki page
  • the maintained support procedure, not an email thread
  • the valid contract terms, not someone’s summary

Humans inside an organization often know which source is authoritative.
AI systems usually do not.
They only see the data that is made available through documents, retrieval pipelines,
context windows, APIs, or tools.

If multiple sources exist and the authoritative one is not clearly identified, the system may retrieve the wrong one.

That is why identifying and exposing golden data is a key design step when building
AI systems.

A simple situation illustrates the problem. Imagine an AI assistant retrieves five documents about refunds:

  1. last year’s policy
  2. this year’s updated policy
  3. a support wiki page summarizing the rule
  4. a training slide deck
  5. a legal document describing campaign exceptions

To a human, it may be obvious that the updated legal policy is the authoritative source.

To the AI system, all five documents may simply look like “relevant text”. If we do not clearly identify the golden source, the system may retrieve the wrong one. The answer may still sound fluent and confident — but it may be wrong. That is why identifying golden data is one of the most important design steps when building AI systems.

Extraction increases access to data — but it does not create data quality.

5. Tools like Docling help a lot — but they do not solve the whole problem

Today, tools like Docling make it much easier to extract content from PDFs, Word documents, PowerPoint files, and Excel sheets and prepare it for AI pipelines.

This is a significant step forward.

It improves access to enterprise knowledge that was previously difficult to use in AI systems. However, extraction is not the same as data quality. Even when extraction works well, several important questions remain:

  • Was the document structure preserved?
  • Was a table interpreted correctly?
  • Did footnotes stay connected to the rules they modify?
  • Was OCR accurate?
  • Was the newest document version used?
  • Is the extracted document actually the authoritative source?

Consider a simple situation.

A refund policy exists as a structured PDF document.
The main rule is defined in a table, and an exception is written in a footnote. After extraction, the AI pipeline may store:

  • the main rule as one text chunk
  • the exception as a separate chunk
  • the metadata as another fragment

Technically, the extraction worked. But the relationship between the rule and the exception may no longer be clear. The system now has more accessible information, but not necessarily more reliable information. Tools like Docling therefore remove an important barrier: they make enterprise knowledge accessible to AI systems.

But they do not automatically guarantee:

  • correctness
  • completeness
  • authority
  • trustworthiness

Extraction improves access to data.
But access is not the same as quality. Even a strong AI architecture cannot compensate for weak, ambiguous, or non-authoritative data.

6. A simple example from beginning to end

Let me use a simple example.

Imagine a company wants to build an AI assistant for customer support.

A customer asks:

“Can I return this product and get a full refund?”

The company builds a modern AI architecture:

  • documents are extracted with Docling
  • chunks are stored in a vector database
  • graph database stores product and policy relationships
  • an MCP server exposes support tools
  • an agent answers the user

On paper, this architecture looks strong.

6.1 The available data

The company has several documents:

  • a refund policy PDF from last year
  • an updated refund policy PDF from this year
  • an Excel file with regional exceptions
  • a support wiki page
  • a legal document for promotional products
  • a few email clarifications between teams

Immediately the first data-quality question appears:

Which source is the real source of truth?

This is not an architecture problem.

It is a data governance problem.

Step 1 — Extraction

The documents are processed and extracted.

Pros
  • content becomes accessible
  • data from PDFs and spreadsheets enters the pipeline
  • information can be indexed and searched
  • hidden enterprise knowledge becomes usable
Cons
  • table structures may be flattened
  • headers and exceptions may lose context
  • OCR may introduce errors
  • version information may not be captured clearly
  • legal notes may detach from the main rule

At this stage the system may become technically richer but semantically weaker.

Step 2 — Chunking and indexing

The extracted text is chunked and stored.

Pros
  • retrieval becomes faster
  • relevant passages can be found
  • embeddings connect similar meanings
  • AI can answer beyond keyword matching
Cons
  • the main rule may land in one chunk
  • the exception may land in another chunk
  • the system may retrieve only the main rule
  • outdated chunks may rank highly
  • metadata may not prioritize authoritative documents

This is a very common failure pattern.

The AI does not fail because it cannot generate language.

It fails because the structure of the data no longer matches the structure of the decision logic.

Step 3 — Agent behavior

Now the agent receives the user question:

“I bought a discounted product in Germany last week. Can I still get a full refund?”

To answer correctly the system needs:

  • the main refund policy
  • the rule for discounted products
  • the Germany-specific exception
  • the time rule
  • the newest version of the policy
Pros
  • the agent can combine multiple sources
  • tools can gather additional details
  • graph relations can connect product, country, and policy
  • the system provides natural language explanations
Cons
  • if one key exception is missing, the answer becomes wrong
  • if the wrong source is treated as authoritative, the answer still sounds correct
  • users may trust the result even when the system lacks complete information
  • additional system components can hide data weaknesses behind technical complexity

Step 4 — The result

The AI answers:

“Yes, you can get a full refund within 14 days.”

The answer sounds clear and professional.

But the correct answer should have been:

“No. Discounted products in Germany under this campaign are excluded from the standard refund policy.”

So what failed?

Not necessarily:

  • the model
  • the vector database
  • the graph database
  • the MCP server
  • the agent concept
  • Docling itself

What failed was the fit between the data and the decision.

The system produced a fluent answer from incomplete knowledge.

7. What this example shows

This simple scenario highlights many of the most important data-quality challenges in AI systems.

The problem was not a missing architectural component.
The system already had extraction, indexing, retrieval, and an agent.

The real problem was the relationship between the data, the decision, and the user
expectation.

7.1 Relevance

The system needs policy data, not just general support content.

If the retrieved information does not match the user’s specific situation, the answer may still sound plausible but remain incorrect.

7.2 Golden data

The system must know which document is authoritative.

If several documents describe the same rule, but only one is the official policy, the AI system needs to treat that document as the primary source.

7.3 Mismatch between data and expectation

The user expects a correct decision.

But the system may only have partial supporting material.

If the data does not fully cover the decision logic, the system may produce a fluent answer based on incomplete information.

7.4 Ownership

Someone must own and maintain the policy data.

Without clear ownership, rules, exceptions, and updates often become fragmented across documents and teams.

Over time, the knowledge base becomes inconsistent.

7.5 Compromised or degraded data

Data can degrade even without malicious intent.

Extraction, copying, chunking, and transformation may weaken the relationship between rules and exceptions.

The meaning of the information may become less clear even though the text still exists.

7.6 Who collects the data

Enterprise knowledge rarely comes from a single source.

Legal teams, support teams, regional teams, and sales teams may all contribute pieces of information.

Each group may document rules differently and with different assumptions.

7.7 AI context

The same data behaves differently depending on how it is used:

  • RAG retrieval
  • long context prompts
  • graph retrieval
  • agent workflows
  • tool responses

Data quality therefore cannot be evaluated independently from the AI usage context.

8. Why the same data performs differently across AI architectures

Another important observation is that data quality cannot be evaluated independently from the AI architecture that uses the data.

The same dataset can perform well in one setup and poorly in another.

This means that data quality is not only about the data itself — it is also about how the AI system interacts with that data.

To illustrate this, consider the refund-policy example from earlier.

The same documents are used, but the system architecture changes.

8.1 In a long context prompt

In a large-context setup, the AI model may receive an entire policy document.

Pro

  • the main rule and its exceptions may remain close together
  • the model can see surrounding explanations and legal wording
  • the system may better understand relationships inside the document

Contra

  • outdated and current policies may appear together
  • irrelevant sections can dilute the answer
  • the model must decide which rule is valid without strong signals

8.2 In RAG

In a retrieval-based system, the model receives only selected chunks.

Pro

  • retrieval can focus on relevant passages
  • large knowledge bases remain scalable
  • irrelevant content can be filtered out

Contra

  • an exception may live in a different chunk and not be retrieved
  • ranking may prefer an outdated policy section
  • the retrieved context may miss the rule that modifies the answer

8.3 In an agent workflow

In an agent-based architecture, multiple tools and steps may be involved.

Pro

  • the system can query additional sources
  • tools can gather structured information
  • reasoning steps can combine different pieces of knowledge

Contra

users may trust the result more than they should

missing or weak data can propagate through several steps

the final answer may look more convincing because the process appears complex

9. Summary

I think this topic will become more important, not less.

Today we are building increasingly sophisticated AI systems with powerful technical components:

  • agents
  • MCP servers
  • vector databases
  • graph databases
  • large context windows
  • document extraction pipelines

These systems represent real technical progress.
But the deeper question remains the same:
What data is actually inside the cup?

If the data is weak, incomplete, outdated, unowned, badly extracted, inconsistent, or disconnected from the real task, the AI system will remain unreliable.

Modern architecture can improve access to data.
It can improve reasoning.
It can improve interaction.

But it cannot replace missing, incorrect, or poorly governed data.

That is why data quality is not a side topic in the AI era.

It is one of the core design questions.

Because in the end:

The cup is not the coffee.

But without a good cup you cannot drink the coffee —
and without good coffee, the cup has no value.

Reliable AI systems require both.

10. References

  1. Patrick Lewis et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv.
    https://arxiv.org/abs/2005.11401
  2. Yunfan Gao et al. (2024). Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv.
    https://arxiv.org/abs/2312.10997
  3. Model Context Protocol. Introduction.
    https://modelcontextprotocol.io/docs/getting-started/intro
  4. Docling Project. Docling Documentation.
    https://docling-project.github.io/docling/
  5. UK Government. Meet the data quality dimensions.
    https://www.gov.uk/government/news/meet-the-data-quality-dimensions
  6. Office for National Statistics. Data Quality Management Policy.
    https://www.ons.gov.uk/aboutus/transparencyandgovernance/datastrategy/datapolicies/dataqualitymanagementpolicy

Note: This post reflects my own ideas and experience; AI was used only as a writing and thinking aid to help structure and clarify the arguments, not to define them.


#ai, #dataquality, #rag, #mcp, #modelcontextprotocol, #docling, #llm, #generativeai, #aiarchitecture, #retrievalaugmentedgeneration, #datastrategy, #datagovernance, #softwarearchitecture, #agenticsystems, #knowledgeaccess, #trustworthyai

One thought on “The Cup Is Not the Coffee: What Data Quality Means in the AI Era

Add yours

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Blog at WordPress.com.

Up ↑