Tutorial: OpenAI ADK + MLflow, end-to-end

In this tutorial we will learn how to use this SDK to evaluate gents. To that end, we will use:

  • OpenAI ADK (Agents SDK) as the “system under test” (the agent under evaluation).

  • MLflow GenAI evaluation to log evaluation runs and results.

We will start with the simplest possible workflow (run a JSONL benchmark locally and write only the final response), then incrementally add MLflow dataset management, and finally add a MorganaBench-compatible trace.

Install dependencies

Dependencies can be installed either with uv (recommended for this repository) or with pip.

Using uv

uv add openai-agents "mlflow[genai]"

Using pip

pip install openai-agents "mlflow[genai]"

1) Local: load benchmark JSONL, run an OpenAI ADK agent, write only outputs.response

The examples below assume you already have a MorganaBench benchmark JSONL file (inputs + expectations) to run. This repository includes one at docs/examples/example-benchmark.jsonl.

This is the smallest end-to-end loop:

  1. Read a MorganaBench benchmark JSONL (inputs + expectations).

  2. For each example, run the agent under evaluation.

  3. Write an executed benchmark JSONL (inputs + expectations + outputs), where outputs contains only:

{"response": "..."}

Example script

This script reads an unexecuted benchmark JSONL and writes an executed JSONL with outputs.response.

from agents import Agent, Runner  # <-- OpenAI ADK (Agents SDK)
from mb.entities import Example, Outputs

def run_benchmark(input_jsonl: str, output_jsonl: str) -> None:
    agent = Agent(
        name="TutorialAgent",
        instructions="You are a helpful assistant.",
    )

    with (
        open(input_jsonl, "r", encoding="utf-8") as fin,
        open(output_jsonl, "w", encoding="utf-8") as fout,
    ):
        for line in fin:
            example = Example.model_validate_json(line)
            result = Runner.run_sync(agent, example.inputs.message_dicts())
            example.outputs = Outputs(response=str(result.final_output))

            fout.write(example.model_dump_json())
            fout.write("\n")


if __name__ == "__main__":
    run_benchmark(
        input_jsonl="docs/examples/example-benchmark.jsonl",
        output_jsonl="out/example-executed.jsonl",
    )

After running it, out/example-executed.jsonl can be uploaded to MorganaBench via the UI for evaluation and tracking.


2) MLflow: upload the benchmark, run the OpenAI ADK agent from predict_fn (no trace)

In this section, we will:

  1. Load a MorganaBench JSONL.

  2. Create an MLflow Evaluation Dataset and upload the records.

  3. Run mlflow.genai.evaluate(...), where MLflow calls predict_fn once per record.

Start MLflow with a SQL backend (SQLite)

Evaluation Datasets are stored in the tracking backend store; therefore, a SQL backend should be used. A minimal local setup is SQLite:

mlflow server \
  --backend-store-uri sqlite:///mlflow.db \
  --default-artifact-root ./mlruns \
  --host 127.0.0.1 --port 5000

Create the dataset and evaluate

For this tutorial, we will run an evaluation with the Correctness scorer. Our dataset format is compatible with it.

import json

import mlflow
from mlflow.genai.datasets import create_dataset
from mlflow.genai.scorers import Correctness

from agents import Agent, Runner


def load_jsonl(path: str) -> list[dict]:
    with open(path, "r", encoding="utf-8") as f:
        return [json.loads(line) for line in f]


agent = Agent(name="TutorialAgent", instructions="Act as a helpful assistant.")


def predict_fn(
    messages: list[dict],
    tools: list[str] | None = None,
    metadata: dict | None = None,
) -> dict:
    result = Runner.run_sync(agent, messages)
    return {"response": str(result.final_output)}


if __name__ == "__main__":
    mlflow.set_tracking_uri("http://127.0.0.1:5000")
    mlflow.set_experiment("morgana-bench-tutorial")

    benchmark = load_jsonl("docs/examples/example-benchmark.jsonl")

    dataset = create_dataset(
        name="morgana_benchmark_tutorial_v1",
        experiment_id=["0"],  # default experiment; adjust as needed
    )
    dataset.merge_records(benchmark)

    mlflow.genai.evaluate(
        data=dataset,
        predict_fn=predict_fn,
        scorers=[Correctness()],
    )

At this point, the following artifacts are available:

  • a versioned dataset in MLflow (enabling consistent re-runs of evaluations), and

  • an evaluation run that stores the agent outputs (as {"response": ...}).


3) Add tracing: modify predict_fn to emit a MorganaBench trace

MorganaBench’s outputs.trace is a list of structured events (tool calls, tool results, retrieval results). In the SDK these are discriminated by the event key (for example: "tool_call", "tool_result", "retriever").

We will derive a trace from the OpenAI ADK runner’s streamed run items:

  • tool_call_item indicates that a tool was invoked and includes the call identifier and arguments.

  • tool_call_output_item provides the output value returned by the tool.

Trace-building predict_fn (from streamed run events)

This is an async predict_fn so that we can consume the streaming iterator. MLflow supports asynchronous predict functions.

import json

from agents import Agent, Runner


agent = Agent(name="TutorialAgent", instructions="Act as a helpful assistant.")


async def predict_fn(messages: list[dict]) -> dict:
    trace: list[dict] = []

    run = Runner.run_streamed(agent, messages)

    async for event in run.stream_events():
        if event.type != "run_item_stream_event":
            continue

        item = event.item

        match item.type:
            case "tool_call_item":
                raw = item.raw_item.model_dump(exclude_unset=True)
                params = json.loads(raw.get("arguments", "{}"))
                trace.append({
                    "event": "tool_call",
                    "id": raw["call_id"],
                    "tool": raw["name"],
                    "params": params,
                })
            case "tool_call_output_item":
                raw = item.raw_item.model_dump(exclude_unset=True)
                trace.append({
                        "event": "tool_result",
                        "id": raw["call_id"],
                        "result": item.output,
                })

    return {"response": str(run.final_output), "trace": trace}

What is included (and what is not)

  • Included: a consistent tool-call / tool-result sequence suitable for MorganaBench assertions such as “a tool was called” and “with these parameters”.

  • Not included automatically: "retriever" events and stable document IDs unless tool outputs carry that information (or a small post-processing step is added to convert tool outputs into retrieval events).

This tutorial provides an incremental path from “local JSONL runner” -> “MLflow dataset + evaluation run” -> “MorganaBench trace-enabled outputs”, all driven by an OpenAI ADK agent.