With this blog post, I want to share (and save) my current thoughts on the Rise of Agentic AI and Managing Expectations.
Agentic AI—systems that autonomously plan and execute multi-step tasks—has emerged as a major trend in 2024 and 2025. In contrast to traditional generative AI that simply “completes” prompts, agentic AI adds a layer of decision-making and action.
With this in mind, I want to examine the IBM resources input to this topic and how LangGraph can be seen in this context, in combination with what I have already blogged and what I have in mind for how to handle the topic.
Introduction
An Agentic AI—system helps to design customized workflows, using tools, and adapted to achieve user goals ( resource ibm.comunite.ai).
For example, an agentic reservation system might proactively collect date, time, and weather data, calling external services as needed before finalizing a booking (source dataplatform.cloud.ibm.com).
IBM characterizes agentic AI systems as marrying the multiple possibilities of LLMs with some kind of precision using programming ( source ibm.com ). By autonomously decomposing complex tasks and interacting with the real world (web services, APIs, databases), agentic systems can handle workflows that static chatbots cannot ( source ibm.comunite.ai).
This new paradigm has quickly gained popularity in business. Major vendors now offer agent lab tools to build such systems: for example, IBM’s watsonx.ai Agent Lab (in beta) lets users assemble agent templates that plan and act.
Industry observers note that 2024 saw a “breakthrough” in agentic AI, with organizations eager to let AI “interpret high-level goals, devise strategies, and autonomously execute plans with minimal human oversight” (source unite.aiibm.com). These agents can combine multiple AI models and tools under a single orchestrator, enabling advanced applications like AI-driven support agents, automated analysis workflows, and intelligent assistants.
However, with the rise of agentic AI, businesses need to be prepared for the complexity it brings. The very capabilities that make agentic AI attractive—long-running state, tool-use, multi-step reasoning—also introduce new points of failure. I pointed out, integrating LLMs into production brings many more dimensions of complexity than traditional microservice apps. I visualized this as an “AI Operational Complexity Cube”, where testing must cover not just the usual functional cases but also dimensions like prompt variations, model uncertainty, user interactions, and ethical considerations and more. That is why it is important manage businesses expectations: agentic systems can behave unpredictably and require careful oversight.
Simply you should not forget: At the moment, a Multi-Agent System is at the end running software on some hardware, used as a tool to help process work more easily built from humans (with help from AI) for Humans 😉
Table of content:
- 1. LangGraph: A Powerful Orchestration Framework
- 2. LangGraph in IBM watsonx.ai
- 3. The Probabilistic Nature of LLMs
- 4. Hallucinations and the Need for Oversight
- 5. Accountability and Transparency
- 6. The AI Operational Complexity Cube (Testing LLM Systems)
- 7. Summary
- 8. References
1. LangGraph: A Powerful Orchestration Framework
At the core of many agentic AI solutions is LangGraph, an open-source “low-level orchestration framework” for building stateful, long-running AI workflows langchain-ai.github.io. Unlike high-level libraries that abstract prompts, LangGraph exposes the building blocks of an agent: a graph of nodes and states that control how the system reasons and acts. It provides durable execution (agents can checkpoint and resume after failures), built-in memory, human-in-the-loop inspection, and detailed debugging tools via LangSmith langchain-ai.github.io. In short, LangGraph lets developers create robust agents that can pause, remember context across sessions, and recover from interruptions—capabilities beyond what most “stateless” prompt chains offer. For example, agents built with LangGraph can incorporate both short-term working memory and persistent long-term memory, enabling multi-turn reasoning that carries knowledge forward langchain-ai.github.io.
Importantly, LangGraph is not limited to purely agentic applications. Because it handles any orchestrated workflow, it can be used for complex, multi-step tasks even outside of a classic agent loop. The documentation notes that LangGraph supports “any long-running, stateful workflow or agent,” meaning you could use it to manage processes like document pipelines, iterative analytics jobs, or orchestrated data integrations (source langchain-ai.github.io ). In practice, this makes LangGraph a versatile tool: you can use it with a ReAct-style agent (Reasoning+Acting) or employ its graph abstractions to enforce structure in other LLM-driven applications. And since LangGraph integrates with the broader LangChain ecosystem, developers also benefit from LangChain’s model integrations and LangSmith’s monitoring. As the LangGraph docs emphasize, it offers “production-ready deployment” infrastructure for exactly these long-running AI tasks ( source langchain-ai.github.io).
One of LangGraph’s standout features is the ability to include custom tools. You can attach Python functions or API calls as tools that the LLM can invoke at runtime. This is the mechanism by which agents can query live data (weather, databases, etc.) or execute computations. For example, in my example implementations in a blog post, a LangChain used a weather API and a knowledge base tool, invoking them via the ReAct loop to get grounded information ( source suedbroecker.netlangchain-ai.github.io ). Thus, LangGraph bridges LLM “reasoning” with concrete external actions in a structured, traceable way, giving developers the power to customize their agents as they see fit.
2. LangGraph in IBM watsonx.ai
IBM’s watsonx.ai platform you can select LangGraph for build your agent. In fact, the Agent Lab in watsonx.ai currently offers only LangGraph for building agents ( source dataplatform.cloud.ibm.com).
When setting up an agent in IBM’s UI or API, users must choose LangGraph as the framework and the ReAct technique as the reasoning architecture ( source dataplatform.cloud.ibm.comdataplatform.cloud.ibm.com).
watsonx allows you to build agents in other ways offline, but when using its built-in lab, LangGraph/ReAct is the default and only option.
IBM’s documentation makes this explicit: “You can only use the LangGraph framework to build agents in the Agent Lab… You can build agents with the ReAct (Reason + Act) technique only.” dataplatform.cloud.ibm.com.
LangGraph’s tight integration with IBM’s watsonx.ai platform means that developers can create agents locally and deploy them seamlessly.
For instance, in my blog post I have demonstrated the deployment of a custom LangGraph ReAct agent (with an added weather tool) to watsonx.ai, and even orchestrating it in IBM’s Orchestrate service.
3. The Probabilistic Nature of LLMs
A key fact about modern LLMs (IBM Granite, ChatGPT, Claude, GPT-4, etc.) is that they are inherently probabilistic, not deterministic —they model language as a sequence of token probabilities, not as a fixed-function rule. In practice, this means the same prompt can yield different answers each time (even when you use for your interaction greedy as sampling ), depending on factors like sampling temperature or even hidden randomness ( source duarteocarmo.comarxiv.org). As one practitioner notes: “At the core, LLMs are stochastic, and not deterministic… given the same input, the output is rarely the same” ( source duarteocarmo.com). Even internal computations introduce nondeterminism. An LLM might “know” multiple valid continuations and will often choose among them somewhat randomly.
This probabilistic nature has important implications: you cannot expect a generative model to always produce the same output or to always be 100% accurate. Two runs with identical prompts may differ in phrasing or even content. The arXiv literature frames LLMs as “inherently probabilistic context-aware mechanisms” ( source arxiv.org). What this means for users and businesses is that the AI’s answers will have variance and occasional mistakes. You must plan for non-determinism by averaging results, adding voting or validation steps, or simply accepting some variability.
The image below I created to visualize that even in small situations, we can have four prompts interacting with LLM models. Question: Will we have an accuracy of 0.9 or 0.65 😉 We need to know how to isolate tests for each model interaction, tool, and combination.

Critically, this uncertainty means LLMs often output plausible-sounding text that is not guaranteed factual. Their “knowledge” is the statistical imprint of their training data, not a logical database. As a result, hallucinations are common. We discuss that next.
4. Hallucinations and the Need for Oversight
“AI hallucination” refers to the phenomenon where an ( Kim, H. (2024). INVESTIGATING THE EFFECTS OF GENERATIVE-AI RESPONSES ON USER EXPERIENCE AFTER AI HALLUCINATION.) LLM produces information that is nonsensical or outright false.
Using IBM’s words, a hallucination is when an AI “perceives patterns or objects that are nonexistent,” yielding outputs “that are nonsensical or altogether inaccurate” ( source ibm.com).
For instance, a chatbot might confidently state a false historical fact or present a made-up statistic. These errors arise because the model is not retrieving facts but generating text that looks like data it has seen. Hallucinations can be subtle (misstating a name or date) or blatant (inventing fictitious sources).
The consequences can be serious. IBM notes that hallucinations can spread misinformation and even cause harm—for example, a hallucinating medical assistant might misdiagnose a patient (source ibm.com).
Therefore, human oversight is essential. Each AI output in a business system should be reviewed, validated, or clearly labeled. Tools like Retrieval-Augmented Generation (RAG) or ReAct agents can mitigate hallucination by letting the model consult external databases or tools.
Indeed, the original ReAct paper showed that interleaving LLM reasoning with API calls (e.g. to Wikipedia) “overcomes issues of hallucination and error propagation” compared to plain chain-of-thought (source arxiv.org). In practice, this means designing agents to double-check their assumptions (asking a knowledge base, invoking a search API, etc.) before finalizing an answer.
Define your mitigation strategies, hallucinations cannot be entirely eliminated. Recent research underscores this point: even with perfect knowledge available, LLMs can still “hallucinate with high certainty” ( source arxiv.org).
In other words, an AI might confidently assert something false even when the correct information exists. This happens because the model’s confidence does not always correlate with truth. As Simhi et al. (2025) demonstrate, “models can hallucinate with high certainty even when they have the correct knowledge” ( source arxiv.org). Such findings drive home that no generative system is infallible.
Because of these limitations, businesses using AI must adopt checks and fallbacks. Typical measures include:
- Human-in-the-loop review: Any high-stakes output (legal advice, financial analysis, etc.) should be reviewed by a qualified person.
- Post-editing workflows: Label AI-generated text clearly and allow humans to correct errors.
- Robust testing: Simulate real-world prompts to gauge AI performance under realistic conditions (in my “AI Complexity Cube” testing checklist) .
- Fallback procedures: If an agent fails to confidently answer, it should defer to humans or fail gracefully.
5. Accountability and Transparency
Given the unpredictability of AI, clear accountability is crucial. Experts emphasize that companies deploying AI remain responsible for its outputs. As Salesforce notes, “Companies are already held accountable for what their AI does” (source salesforce.com).
Businesses must be prepared to explain and justify AI-driven decisions. If an AI makes a mistake—say, a chatbot mishandles a customer request—the company cannot simply blame the machine.
Instead, organizations should establish clear ownership (often creating roles like an AI Ethics Officer or similar) to monitor AI behavior and intervene when needed ( source salesforce.com).
Transparency goes hand-in-hand with responsibility. Consumers have a right to know when AI is involved. Studies show that disclosing AI use builds trust. For example, one industry guide reports that “transparency about AI usage can set more realistic expectations about the content and reduce misunderstanding concerning the accuracy of the content” (source kontent.ai). In practice, this means labeling AI-generated content or clearly informing users when an agent is acting. Such disclosure helps users interpret outputs appropriately (knowing it’s AI, not a human expert, may temper their trust) (source kontent.ai).
Moreover, transparency is increasingly a compliance issue. Emerging regulations (and even private-sector guidelines) are beginning to require that companies reveal the use of generative AI in products and communications. As one guide notes, explaining how content was created is “the foundation of explainability” (source kontent.ai). Therefore, businesses should adopt policies to govern AI use and enforce disclosure. This might include internal audits of AI workflows, training for staff on AI limitations, and clear user-facing notices (e.g. “This email draft was generated by AI under human supervision,” or “AI chat assistant”).
In sum, that means: be transparent, test thoroughly, keep humans in charge, and maintaining clear legal/ethical accountability for the outputs. The sophistication of agentic systems makes this all the more important.
A good starting point to address these challenges is to check out watsonx.ai Governance.
6. The AI Operational Complexity Cube (Testing LLM Systems)
My AI Operational Complexity Cube concept captures the reasons why these precautions are necessary. He envisions LLM-enabled systems as adding new “axes” of complexity beyond traditional software. For example, one dimension is prompt/testing complexity: variations in how users ask questions can dramatically change AI behavior. Another axis is data complexity, since models may require constant retraining or updates. A third is user interaction complexity, as humans engage with an AI that has its own reasoning loop. I suggesting visualizing these factors as a cube, where every face represents a testing or validation challenge ( source suedbroecker.net ).
This idea serves as a proof point, reminding us that integrating generative AI is not a simple plug-and-play improvement. Like any complex piece of enterprise software, an AI agent must be carefully monitored across multiple dimensions (functional correctness, prompt robustness, performance, fairness, etc.) ( source suedbroecker.net) . Only with such rigorous testing and governance can organizations safely leverage agentic AI in production.
7. Summary
Agentic AI and frameworks like LangGraph unlock powerful new capabilities—enabling AI systems that can plan, act, and learn in ways simple chatbots cannot. LangGraph’s robust framework makes it practical to build and deploy these stateful agents (officially supported in IBM’s watsonx.ai) ( source langchain-ai.github.iodataplatform.cloud.ibm.com ).
At the same time, the probabilistic nature of LLMs means outputs will vary and sometimes be wrong ( source arxiv.orgarxiv.org ).
Hallucinations are a real risk, so human oversight, transparency, and accountability are non-negotiable. As experts emphasize, businesses must clearly disclose AI use and remain responsible for the results ( source kontent.aisalesforce.com ). By adopting strict testing (example in the complexity cube), building in checks, and setting realistic expectations, organizations can harness agentic AI’s potential without being blindsided by its limitations.
Ultimately, agentic AI is a promising evolution of generative AI—but it requires just as much caution and governance as any other major enterprise technology. Properly managed, LangGraph-powered agents can drive efficiency and innovation, but stakeholders must always remember that an LLM’s “brain” is a probabilistic one. Keeping humans “in the loop” and clear-eyed about the AI’s role will be key to success.
8. References:
- Südbröcker, Thomas (2025), “Develop and Deploy Custom AI Agents to watsonx.ai on IBM Cloud”, Thomas Südbröcker’s Blog (Feb 25, 2025): detailed walkthrough of building LangGraph ReAct agents suedbroecker.net.
- Südbröcker, Thomas (2025), “Supercharge Your Support: Example Build & Orchestrate AI Agents with watsonx.ai and watsonx Orchestrate”, Thomas Südbröcker’s Blog (May 15, 2025): example of multi-agent orchestration with LangGraph suedbroecker.netsuedbroecker.net.
- Südbröcker, Thomas (2025), “Exploring the ‘AI Operational Complexity Cube’ idea for Testing Applications integrating LLMs”, Thomas Südbröcker’s Blog (Mar 24, 2025): introduces the complexity cube concept for LLM-based systems suedbroecker.net.
- LangGraph Official Documentation, LangChain AI (online guides and reference, e.g. “LangGraph” homepage and quickstart) langchain-ai.github.iolangchain-ai.github.io.
- IBM Watsonx.ai Documentation, Agent Lab (beta) and Automating tasks with AI agents: explains that LangGraph is the sole framework and ReAct the sole architecture for building AI agents in watsonx.ai dataplatform.cloud.ibm.comdataplatform.cloud.ibm.com.
- IBM Think Article (2024), “Agentic AI: 4 reasons why it’s the next big thing in AI research”: overview of agentic AI concepts and benefits ibm.comibm.com.
- Yao et al. (2022), “ReAct: Synergizing Reasoning and Acting in Language Models” (arXiv:2210.03629): original ReAct paper demonstrating that interleaving chain-of-thought and action (tool use) makes LLMs more reliable, reducing hallucinations arxiv.org.
- Simhi et al. (2025), “Trust Me, I’m Wrong: High-Certainty Hallucinations in LLMs” (arXiv:2502.12964): shows that LLMs can output confidently incorrect (hallucinated) answers, highlighting their unreliability without checks arxiv.org.
- Duarte & Carmo (2024), LLMs in production: lessons learned (blog): notes that LLM outputs are stochastic, with “the output [being] rarely the same” for a given input duarteocarmo.com.
- IBM Think Piece (2023), “What are AI hallucinations?”: defines AI hallucination and warns of real-world risks (healthcare, misinformation) if unchecked ibm.comibm.com.
- Salesforce (2025), “In a World of AI Agents, Who’s Accountable for Mistakes?” (blog): emphasizes that businesses must be able to explain and justify AI decisions and remain legally responsible for AI-driven outcomes salesforce.comsalesforce.com.
- Kontent.ai (2024), “Emerging best practices for disclosing AI-generated content” (blog): advises that disclosing AI usage builds trust; “transparency about AI usage can set more realistic expectations” and is foundational to user trust kontent.ai kontent.ai.
I hope this was useful to you and let’s see what’s next?
Greetings,
Thomas
#AgenticAI, #LangGraph, #LLMs, #AIOrchestration, #AIAgents, #IBMwatsonxai, #ReAct, #LLMTesting, #AIOperationalComplexityCube, #AIGovernance, #AIOversight, #HumanInTheLoop, #LangChain, #LangSmith, #StatefulAI, #MultiAgentSystems, #GenerativeAIWorkflows, #LLMHallucinations, #ProbabilisticAI, #AITransparency, #AIAccountability, #AILimitations, #ResponsibleAI, #AIinProduction, #EnterpriseAI, #FutureOfAI, #AIExpectations
