How do you know if your AI Agent really does what you expect? That’s where the Evaluation Framework in the watsonx Orchestrate ADK comes in. Instead of abstract theory, this post walks you through a concrete, reproducible example:
👉 Galaxium Travels – a fictional booking system, running locally with Docker, where we test one Agent connected to one Python Tool.
By the end, you’ll see step by step how to:
Define simple Stories (what users want from the Agent)
Generate synthetic Test Cases from those stories
Run evaluations directly in the watsonx Orchestrate ADK
Inspect results to verify correctness, reliability, and transparency.
This practical workflow helps you gain trust in your AI Agents, understand their decisions, and catch issues early.
Custom Docker-Compose in watsonx Orchestrate it not official supported: LINK Screen Shot 2025-11-10
Related 35 min YouTube Video :-).
Table of content
Motivation
Set up the example Galaxium Travels Infrastructure
Set up the watsonx Orchestrate ADK and watsonx Orchestrate Development Edition Server
Create and import a local Python tool
Add a connection
Import an Agent
Generate synthetic test cases
Run the evaluation
Final thoughts
References & further reading
Optional Additional network monitoring option on the local machine
1. Motivation
In this blog post, we’ll explore how to test an AI Agent using the Evaluation Framework from the watsonx Orchestrate Agent Development Kit (ADK). Instead of diving into abstract theory, we’ll use a hands-on, concrete example.
👉 Galaxium Travels – a fictional booking system that runs locally with Docker. We will connect one Agent with one Tool and define a few small test stories to run automated evaluations.
By the end, you will clearly understand:
How to define simple Stories (what a user wants from the agent)
How to generate synthetic Test Cases from these stories
How to run evaluations in the watsonx Orchestrate ADK
How to inspect results to check correctness and reliability
This makes the topic not only practical but also reproducible on your own machine. We are building a Test Case for listing all available flights in Galaxium Travels.
1.1 Why this matters
AI Agents are powerful but often unpredictable. Without proper testing, they can surprise you in unexpected ways.
As I highlighted in Exploring the “AI Operational Complexity Cube idea” for Testing Applications integrating LLMs Evaluation and validation are essential for:
Trust – Can I rely on my Agent to give correct results?
Reliability – Will it work in different situations, not just once?
Transparency – Can I understand why the Agent took certain actions?
The watsonx Orchestrate ADK provides a built-in Evaluation Framework that helps answer these questions. This post demonstrates how to use it step by step, making the process easy for newcomers and useful for advanced users.
1.2 The Evaluation Framework
These are the main steps in the watsonx Orchestrate Evaluation Framework to realize it.
Define the Stories, which means what you want to achieve with an agent and its related tooling.
Generate synthetic Test Cases based on the given Stories
A synthetic Test Cases contains
Agent
Story
Starting sentence
Goals
Goal details
Run the Evaluation
Examine the Evaluation Results
The following image is a screenshot from watsonx Orchestrate documentation on 2025-09-11 and documents the concept of the evaluation tooling in thewatsonx Orchestrate ADK.
The following image displays the main components of the evaluation example.
Agent
Stories
Tools
Test Cases
Test Result
1.3 Key steps of the Evaluation Framework example
The GIF below illustrates the steps.
Set up the infrastructure
Clone the example repo, prepare environment variables, start everything with Docker Compose.
Set up watsonx Orchestrate ADK + local server
Create & activate a virtual environment, install the necessary CLI, configure the .env file for local or live usage, start/stop the server, and integrate with Galaxium services.
Create & import a local Python tool
Build the folder structure, write the Tool (making requests to the /flights endpoint), test it locally, freeze its dependencies, and import it into WatsonX Orchestrate.
Add a connection
Define connections so Agent can invoke backend REST services; set credentials (if needed).
Import an agent
Use a YAML configuration to define the Agent: which LLM, which Tool(s), instructions for behavior, etc. Import via CLI.
Generate synthetic test cases
Define “stories” in a CSV (e.g. “What can you do for me?”, “How to get the available flights?”). Then use the evaluate/generate command with ADK to produce JSON test cases based on stories + tools.
Run the evaluation
Pick one or more test case JSON files; run Orchestrate Evaluations Evaluate; gather results (Tool invocations, summaries, matching vs goals).
Inspect & interpret results
Review snapshot files, goal details, output summaries. Compare expected vs actual behavior. Identify mismatches.
1.5 Set up a folder structure framework
Clone the GitHub following repository as a folder structure framework to run the example.
git clone https://github.com/thomassuedbroecker/galaxium_travels_evaluation_example
tree .
Step 7: Start the watsonx Orchestrate Development Edition Server again
We start the server with Langfuse for monitoring the local LangGraph agents.
Therefore, we use the generated and configured server.env and docker-compose.yml file.
Start server
cd watsonx-orchestrate-adk
source ./.venv/bin/activate
export SERVER_ENVIRONMENT=server.env
export DOCKER_COMPOSE_FILE=docker-compose.yml
orchestrate server start -e ${SERVER_ENVIRONMENT} -f ${DOCKER_COMPOSE_FILE} --with-langfuse
Start chat lite ui
orchestrate chat start
✅ Done! We now have:
The ADK installed
The development server running
Both systems (backend + ADK) active in parallel
Next, we’ll create a local Python Tool to connect the Agent with the backend.
4. Create and import a local Python Tool
Now that the watsonx Orchestrate Development Edition server and the Galaxium Travels backend are running, we’ll build a Python Tool that lets our Agent fetch flight information.
A Tool is basically:
A Python function that performs an action (e.g., call a REST API)
Registered so the Agent can discover and use it
Step 1: Set up the following folder structure
cd watsonx-orchestrate-adk
source ./.venv/bin/activate
mkdir tools
mkdir tools/getFlights_tool_local
mkdir tools/getFlights_tool_local/source
cd tools/getFlights_tool_local/source
touch getFlights_tool_localhost.py
touch requirements.txt
tree .
The tool invokes the REST API endpoint for receiving all flights in the booking backend.
from ibm_watsonx_orchestrate.agent_builder.tools import tool, ToolPermission
import requests
@tool(name="getFlights_tool_localhost",
description="Retrieve a list of all available flights, including origin, destination, departure and arrival times, price, and the number of seats currently available for booking.",
permission=ToolPermission.ADMIN)
def getFlights_tool_localhost() -> dict:
"""This tool Retrieve a list of all available flights, including origin, destination, departure and arrival times, price, and the number of seats currently available for booking.
Returns:
dict : The list of available flights.
"""
url = "http://localhost:8082/flights" #outside compose
#url = "http://booking_system:8082/flights" #inside compose
headers = {
"Content-Type": "application/json",
}
try:
response = requests.get(url, headers=headers)
if response.status_code == 200:
return response.json()
else:
return_value = {"error": {response.text()} }
return_value_array = [{return_value}]
return return_value_array
except Exception as e:
return_value = {"error": {e} }
return_value_array = [{return_value}]
return return_value_array
Add the following code to the tool code to test the tool locally. This code must be commented out later when we are going to import the tool to watsonx Orchestrate. We just execute the tool as a local Python application.
# Test the tool
print(f"Tool output: {getFlights_tool_localhost()}")
Navigate back to the watsonx-orchestrate-adk folder
cd ../../../
✅ Done! We now have:
Python Tool (getFlights_tool_localhost)
Connected to the Galaxium Travels backend
Visible in the ADK development server
Next, we’ll import the Agent and connect it with this Tool.
5. Add a connection
In watsonx Orchestrate, we can add a connection to external systems using various authentication options. It defines URL and the authentication to the external system, which can be used later from the Tools.
In the image below you see the Tool and the Connection configured to connect inside compose the the backend.
Step 1: Create a connection
cd watsonx-orchestrate-adk
cd tools/getFlights_tool_local/source
export APPLICATION_ID="galaxium-travels-booking-backend"
orchestrate connections add --app-id ${APPLICATION_ID}
cd ../../../
Step 2: Configure the connection
Now we define that the team in watsonx Orchestrate can use the connection to connect to the external system.
At the moment the our backend doesn’t have an authentication, but when you invoke an unprotected REST API with a base authentication, the call will also be executed. This is just to show how to setup a connection for the backend.
cd watsonx-orchestrate-adk
cd tools/getFlights_tool_local/source
export APPLICATION_ID="galaxium-travels-booking-backend"
export ENVIRONMENT="draft"
export SERVICE_USERNAME="admin"
export SERVICE_PASSWORD="admin"
orchestrate connections set-credentials --app-id ${APPLICATION_ID} --environment ${ENVIRONMENT} --username ${SERVICE_USERNAME} --password ${SERVICE_PASSWORD}
cd ../../../
6. Import an Agent
With our Tool in place, the next step is to import an Agent into the ADK development server. This Agent will use the getFlights_tool_localhost Tool to answer requests about available flights.
Step 1: Set up the following folder structure
cd watsonx-orchestrate-adk
mkdir ./agents
touch ./agents/Galaxium_Travels_GetFlights_Agent.yaml
Step 2: Insert the following Agent configuration into the created file
Now we create a new agent from an Agent configuration file, in this case a yaml file.
kind: native
name: Galaxium_Travels_GetFlights_Agent
display_name: Galaxium_Travels_GetFlights_Agent
description: This Agent manages all the required tasks a user would perform to list available flights.
context_access_enabled: true
context_variables: []
llm: watsonx/meta-llama/llama-3-2-90b-vision-instruct
style: default
instructions: |-
You are Arthur, the AI Agent of the Galaxium Travels Company, and you are here to help users manage their travels through our Galaxy with the Galaxium Travels Company.
Select the right available tool to fulfill the following tasks for a user who interacts with the Agent:
* List available flights.
Your output must always provide well-formatted Markdown language.
You can use bold, bullet points, tabs, or numbering to structure the text so that it is easier to display in a format that is more readable by human readers when it makes sense.
Your output must always be well-formatted Markdown.
You can use bold, bullet points, tabs, or numbering to structure the text so that it is easier to display in a format that is more readable by human readers when it makes sense.
Use Markdown table and charts formatting, where possible, to convey the content to the user, but ensure you generate VALID content for the tables.
guidelines: []
collaborators: []
tools:
- getFlights_tool_localhost
knowledge_base: []
chat_with_docs:
enabled: false
vector_index:
chunk_size: 400
chunk_overlap: 50
limit: 10
generation:
prompt_instruction: ''
max_docs_passed_to_llm: 5
generated_response_length: Moderate
display_text_no_results_found: I searched my knowledge base, but did not find
anything related to your query
display_text_connectivity_issue: I might have information related to your query
to share, but am unable to connect to my knowledge base at the moment
idk_message: I'm afraid I don't understand. Please rephrase your question.
query_rewrite:
enabled: true
confidence_thresholds:
retrieval_confidence_threshold: Lowest
response_confidence_threshold: Lowest
citations:
citation_title: How do we know?
citations_shown: -1
hap_filtering:
output:
enabled: false
threshold: 0.5
starter_prompts:
is_default_prompts: false
prompts:
- id: default0
title: How can you help me?
subtitle: How can you help me?
prompt: How can you help me?
state: active
- id: default1
title: What are the available flight?
subtitle: What are the available flight?
prompt: What are the available flight?
state: active
welcome_content:
welcome_message: Welcome! I am Arthur, your AI Agent, here to help you to manage your travels through!
description: Accuracy of generated answers may vary. Please double-check responses.
is_default_message: false
spec_version: v1
Navigate back to the watsonx-orchestrate-adk folder
cd ../
Step 4: List the Agents again
We need to ensure that we get the full name of the Agent we want to export. If needed, open a new terminal, in the watsonx-orchestrate-adk folder, and insert the following commands.
cd /watsonx-orchestrate-adk
source .venv/bin/activate
orchestrate agents list
In the output, we notice that the name has not changed during the import.
An Agent (Galaxium Agent) that knows how to use it
Both available in the ADK development server
Next, we’ll create Stories and generate Test Cases to evaluate the Agent.
7. Generate the synthetic test cases
Define stories based a combination of the initial question of a conversation with an Agent in a csv file
Generate synthetic Test Cases based on the given stories.
Step 1: Set up following folder structure
cd /watsonx-orchestrate-adk
source .venv/bin/activate
mkdir ./evaluations
mkdir ./evaluations/stories
cd ./evaluations/stories
touch galaxium_travels_stories.csv
cd ../../
tree .
We define a starting point for a story for an interaction with the Agent. Stories are stored in a simple CSV file. Each row represents one user request.
Insert the content
echo "story,agent" >> ./evaluations/stories/galaxium_travels_stories.csv
echo "What can you do for me?,Galaxium_Travels_GetFlights_Agent" >> ./evaluations/stories/galaxium_travels_stories.csv
echo "How to get the available flights?,Galaxium_Travels_GetFlights_Agent" >> ./evaluations/stories/galaxium_travels_stories.csv
The synthetic test case is a configuration in a JSON format. Next, we can execute an evaluation.
You can see in the JSON the defined goals which will be verified later during the evaluation. In the following example, synthetic test case extraction, the test case verifies that the Agent will do an invocation of the tool get_flight and should provide summarization of the result. The given result will be used later to assert an execution result.
Here is an example for one the generated test cases:
Agent
Story
Starting sentence
Goals
Goal details
{
"agent": "Galaxium_Travels_GetFlights_Agent",
"story": "What can you do for me?",
"starting_sentence": "I'd like to know what flights are available.",
"goals": {
"get_flights": [
"summarize"
]
},
"goal_details": [
{
"type": "tool_call",
"name": "get_flights",
"tool_name": "getFlights_tool_localhost",
"args": {}
},
{
"type": "text",
"name": "summarize",
"response": "Here are the available flights: Earth to Mars, Earth to Moon, Mars to Earth, Venus to Earth, Jupiter to Europa, Earth to Venus, Moon to Mars, Mars to Jupiter, Europa to Earth, Earth to Pluto.",
"keywords": [
"Earth",
"Mars",
"Moon",
"Venus",
"Jupiter",
"Europa",
"Pluto"
]
}
]
}
8. Run the evaluation
Now that we have our Test Cases (test_cases.json), we can run the Evaluation Framework to test the Agent and the Tool. This step checks if the Agent behaves as expected when executing the Stories.
Step 1: Create output directory for the test results
cd watsonx-orchestrate-adk
mkdir ./evaluations/test_results
Step 2: Define the resources for the evaluation
Now we can define which of the test cases we want to execute using environment variables.
Next, we’ll explore how to inspect and analyze results in more detail.
9. Final thoughts
This example is simple—but that’s give a good starting point for a base understanding. It gives a clean end-to-end view of how a story → tool → agent → test → evaluation works. Once comfortable, you can extend:
More complex tools
More elaborate stories (branching, error cases)
Higher diversity of test cases
Live monitoring + observability
If you follow through, you’ll gain confidence in how your AI agent behaves, catch issues early, and improve reliability.
In addition here a quick checklist: Testing AI Agents with watsonx Orchestrate ADK
✅ Define clear user stories that include context, goal, and expected interaction
✅ Build or import tools & REST APIs your agent needs
✅ Generate synthetic test cases automatically from your stories
✅ Set up backend + local environment to test tools locally
✅ Use Evaluation Framework to simulate “trajectories” → compare actual behavior to ground-truth data
✅ Check for correct ordering of tool calls + correct input parameters
✅ Evaluate summary quality: final answers should be relevant, concise, accurate
✅ Use “ground truth datasets” with expected conversations, tool-calls, goal dependencies etc.
✅ Inspect evaluation results, logs, detect where errors occur → iterate to improve