Tutorial: OpenAI ADK + MLflow, end-to-end
In this tutorial we will learn how to use this SDK to evaluate gents. To that end, we will use:
OpenAI ADK (Agents SDK) as the “system under test” (the agent under evaluation).
MLflow GenAI evaluation to log evaluation runs and results.
We will start with the simplest possible workflow (run a JSONL benchmark locally and write only the final response), then incrementally add MLflow dataset management, and finally add a MorganaBench-compatible trace.
Install dependencies
Dependencies can be installed either with uv (recommended for this repository) or with pip.
Using uv
uv add openai-agents "mlflow[genai]"
Using pip
pip install openai-agents "mlflow[genai]"
1) Local: load benchmark JSONL, run an OpenAI ADK agent, write only outputs.response
The examples below assume you already have a MorganaBench benchmark JSONL file (inputs + expectations) to run.
This repository includes one at docs/examples/example-benchmark.jsonl.
This is the smallest end-to-end loop:
Read a MorganaBench benchmark JSONL (
inputs+expectations).For each example, run the agent under evaluation.
Write an executed benchmark JSONL (
inputs+expectations+outputs), whereoutputscontains only:
{"response": "..."}
Example script
This script reads an unexecuted benchmark JSONL and writes an executed JSONL with outputs.response.
from agents import Agent, Runner # <-- OpenAI ADK (Agents SDK)
from mb.entities import Example, Outputs
def run_benchmark(input_jsonl: str, output_jsonl: str) -> None:
agent = Agent(
name="TutorialAgent",
instructions="You are a helpful assistant.",
)
with (
open(input_jsonl, "r", encoding="utf-8") as fin,
open(output_jsonl, "w", encoding="utf-8") as fout,
):
for line in fin:
example = Example.model_validate_json(line)
result = Runner.run_sync(agent, example.inputs.message_dicts())
example.outputs = Outputs(response=str(result.final_output))
fout.write(example.model_dump_json())
fout.write("\n")
if __name__ == "__main__":
run_benchmark(
input_jsonl="docs/examples/example-benchmark.jsonl",
output_jsonl="out/example-executed.jsonl",
)
After running it, out/example-executed.jsonl can be uploaded to MorganaBench via the UI for evaluation and tracking.
2) MLflow: upload the benchmark, run the OpenAI ADK agent from predict_fn (no trace)
In this section, we will:
Load a MorganaBench JSONL.
Create an MLflow Evaluation Dataset and upload the records.
Run
mlflow.genai.evaluate(...), where MLflow callspredict_fnonce per record.
Start MLflow with a SQL backend (SQLite)
Evaluation Datasets are stored in the tracking backend store; therefore, a SQL backend should be used. A minimal local setup is SQLite:
mlflow server \
--backend-store-uri sqlite:///mlflow.db \
--default-artifact-root ./mlruns \
--host 127.0.0.1 --port 5000
Create the dataset and evaluate
For this tutorial, we will run an evaluation with the Correctness scorer. Our dataset format is compatible with it.
import json
import mlflow
from mlflow.genai.datasets import create_dataset
from mlflow.genai.scorers import Correctness
from agents import Agent, Runner
def load_jsonl(path: str) -> list[dict]:
with open(path, "r", encoding="utf-8") as f:
return [json.loads(line) for line in f]
agent = Agent(name="TutorialAgent", instructions="Act as a helpful assistant.")
def predict_fn(
messages: list[dict],
tools: list[str] | None = None,
metadata: dict | None = None,
) -> dict:
result = Runner.run_sync(agent, messages)
return {"response": str(result.final_output)}
if __name__ == "__main__":
mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_experiment("morgana-bench-tutorial")
benchmark = load_jsonl("docs/examples/example-benchmark.jsonl")
dataset = create_dataset(
name="morgana_benchmark_tutorial_v1",
experiment_id=["0"], # default experiment; adjust as needed
)
dataset.merge_records(benchmark)
mlflow.genai.evaluate(
data=dataset,
predict_fn=predict_fn,
scorers=[Correctness()],
)
At this point, the following artifacts are available:
a versioned dataset in MLflow (enabling consistent re-runs of evaluations), and
an evaluation run that stores the agent outputs (as
{"response": ...}).
3) Add tracing: modify predict_fn to emit a MorganaBench trace
MorganaBench’s outputs.trace is a list of structured events (tool calls, tool results, retrieval results).
In the SDK these are discriminated by the event key (for example: "tool_call", "tool_result", "retriever").
We will derive a trace from the OpenAI ADK runner’s streamed run items:
tool_call_itemindicates that a tool was invoked and includes the call identifier and arguments.tool_call_output_itemprovides the output value returned by the tool.
Trace-building predict_fn (from streamed run events)
This is an async predict_fn so that we can consume the streaming iterator. MLflow supports asynchronous predict functions.
import json
from agents import Agent, Runner
agent = Agent(name="TutorialAgent", instructions="Act as a helpful assistant.")
async def predict_fn(messages: list[dict]) -> dict:
trace: list[dict] = []
run = Runner.run_streamed(agent, messages)
async for event in run.stream_events():
if event.type != "run_item_stream_event":
continue
item = event.item
match item.type:
case "tool_call_item":
raw = item.raw_item.model_dump(exclude_unset=True)
params = json.loads(raw.get("arguments", "{}"))
trace.append({
"event": "tool_call",
"id": raw["call_id"],
"tool": raw["name"],
"params": params,
})
case "tool_call_output_item":
raw = item.raw_item.model_dump(exclude_unset=True)
trace.append({
"event": "tool_result",
"id": raw["call_id"],
"result": item.output,
})
return {"response": str(run.final_output), "trace": trace}
What is included (and what is not)
Included: a consistent tool-call / tool-result sequence suitable for MorganaBench assertions such as “a tool was called” and “with these parameters”.
Not included automatically:
"retriever"events and stable document IDs unless tool outputs carry that information (or a small post-processing step is added to convert tool outputs into retrieval events).
This tutorial provides an incremental path from “local JSONL runner” -> “MLflow dataset + evaluation run” -> “MorganaBench trace-enabled outputs”, all driven by an OpenAI ADK agent.