MorganaBench JSONL format overview

This page teaches the MorganaBench JSONL format progressively: start with one line you can understand at a glance, then gradually introduce more concepts.

This is only about the data format and the SDK’s Pydantic models.

If you are using Python, everything described in this tutorial is available as validated Pydantic models in this SDK; see the Python SDK section below.

The benchmark record format is aligned with MLflow-style evaluation (inputs / expectations / outputs), and we also have an MLFlow tutorial: Tutorial: OpenAI ADK + MLflow, end-to-end.

Before we start: JSONL format

A MorganaBench benchmark file is a JSONL file: each line is one self-contained JSON object. There is no outer [ … ] array around the file.

In a real JSONL file, each example is one line.

In this tutorial, we format JSON in a pretty multi-line style for readability. In the actual benchmark JSONL that MorganaBench provides, and in the executed benchmark JSONL that you upload back, each example is still serialized as a single line.

Two terms we’ll use throughout:

Benchmark JSONL: lines with inputs + expectations.
Executed benchmark JSONL: the same lines, but with outputs populated after running your agent.

1) Hello world: one turn, no tools

We’ll start with the smallest useful benchmark example: a single user message and a “good answer” description.

Benchmark JSONL (provided by MorganaBench): `inputs` + `expectations`

{
  "inputs": {
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
  },
  "expectations": {
    "expected_response": "Paris is the capital of France."
  }
}

What this says:

inputs.messages is a chat transcript (OpenAI-style).
expectations.expected_response is what we’ll compare the final answer to.

Executed benchmark JSONL: add `outputs.response`

An executed benchmark line looks the same, but includes what actually happened under outputs:

{
  "inputs": {
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
  },
  "expectations": {
    "expected_response": "Paris is the capital of France."
  },
  "outputs": {
    "response": "Paris is the capital of France."
  }
}

2) Add a trace for retrieval evaluation

So far, we can evaluate the final response text, but we cannot evaluate what documents were retrieved.

To evaluate retrieval quality, MorganaBench relies on outputs.trace: a list of events describing what happened while producing the answer.

In the executed benchmark you upload back, include retrieval events that record “these are the chunks we retrieved”.

In this SDK, a retrieval event has:

event: "retriever"
outputs: a list of retrieved chunks, each with an application-specific id and page_content

Executed example with retrieval in the trace:

{
  "inputs": {
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
  },
  "expectations": {
    "expected_response": "Paris is the capital of France."
  },
  "outputs": {
    "response": "Paris is the capital of France.",
    "trace": [
      {
        "event": "retriever",
        "outputs": [
          {"id": "doc_1", "page_content": "Paris is the capital and most populous city of France."},
          {"id": "doc_2", "page_content": "France's capital city is Paris."}
        ]
      }
    ]
  }
}

Two small but important details:

The chunk ids are application-defined. They just need to be stable enough for your evaluation.
Those chunk ids are also what citations reference later (see section 6).

3) Assertions about tool calls

A benchmark can also include assertions about what the agent should do, not just what it should say.

The SDK currently supports tool-call assertions in expectations.assertions. For example, a benchmark may assert that:

a specific tool was called
its parameters match specific criteria (exact match, free-text match, missing parameter, optional, etc.)

Example benchmark line with an assertion that search was called:

{
  "inputs": {
    "messages": [{"role": "user", "content": "Who is the King of England?"}],
  },
  "expectations": {
    "expected_response": "King Charles III is the current monarch of the United Kingdom.",
    "assertions": [
      {
        "assert_that": "tool_called",
        "tool": "search",
        "parameters": [
          {"param": "query", "matcher": {"match_as": "free_text", "value": "King Charles III"}},
          {"param": "limit", "matcher": {"match_as": "equality", "value": 5}},
          {"param": "site", "matcher": {"match_as": "missing"}}
        ]
      }
    ]
  }
}

This is still a benchmark (unexecuted) line: it describes what is expected, but does not include any results yet.

Some assertions group parameters together. Those use params (a list) instead of param (a single name):

{
  "inputs": {
    "messages": [
      {"role": "user", "content": "Schedule a design review next Friday at 2pm with Alex at HQ."}
    ],
  },
  "expectations": {
    "expected_response": "Sure, I will schedule the design review.",
    "assertions": [
      {
        "assert_that": "tool_called",
        "tool": "calendar",
        "parameters": [
          {
            "params": ["day", "month", "year", "hour", "minute"],
            "matcher": {"match_as": "date_time", "value": "next Friday at 2pm"}
          },
          {
            "params": ["title", "description"],
            "matcher": {"match_as": "free_text", "value": "MorganaBench format design review"}
          },
          {"param": "attendees", "matcher": {"match_as": "equality", "value": "[email protected]"}}
        ]
      }
    ]
  }
}

Note on inputs.tools: benchmarks often omit this field. When it is present, it is just a restriction: out of all tools available to the agent, use only this specific subset for this example.

4) To verify tool assertions, add tool events to the trace (executed benchmark)

Assertions only become checkable once the executed benchmark includes the tool calls it actually made.

Tool-call assertions are verified against outputs.trace, which should contain events like:

{"event":"tool_call","id":"...","tool":"...","params":{...}}
{"event":"tool_result","id":"...","result":...}

Executed example that includes tool-call trace:

{
  "inputs": {
    "messages": [{"role": "user", "content": "Who is the King of England?"}],
  },
  "expectations": {
    "expected_response": "King Charles III is the current monarch of the United Kingdom.",
    "assertions": [
      {
        "assert_that": "tool_called",
        "tool": "search",
        "parameters": [
          {"param": "query", "matcher": {"match_as": "free_text", "value": "King Charles III"}},
          {"param": "limit", "matcher": {"match_as": "equality", "value": 5}},
          {"param": "site", "matcher": {"match_as": "missing"}}
        ]
      }
    ]
  },
  "outputs": {
    "response": "King Charles III is the current monarch of the United Kingdom.",
    "trace": [
      {
        "event": "tool_call",
        "id": "call_1",
        "tool": "search",
        "params": {"query": "King Charles III", "limit": 5}
      },
      {
        "event": "tool_result",
        "id": "call_1",
        "result": {
          "status": "ok",
          "items": ["https://en.wikipedia.org/wiki/King_Charles_III"]
        }
      }
    ]
  }
}

If a benchmark uses time-aware matchers (like "date_time"), include outputs.environment.user_time in the executed benchmark so the matcher has a reference point:

{
  "inputs": {
    "messages": [
      {"role": "user", "content": "Schedule a design review next Friday at 2pm with Alex at HQ."}
    ],
  },
  "expectations": {
    "expected_response": "Sure, I will schedule the design review.",
    "assertions": [
      {
        "assert_that": "tool_called",
        "tool": "calendar",
        "parameters": [
          {
            "params": ["day", "month", "year", "hour", "minute"],
            "matcher": {"match_as": "date_time", "value": "next Friday at 2pm"}
          },
          {
            "params": ["title", "description"],
            "matcher": {"match_as": "free_text", "value": "MorganaBench format design review"}
          },
          {"param": "attendees", "matcher": {"match_as": "equality", "value": "[email protected]"}}
        ]
      }
    ]
  },
  "outputs": {
    "response": "I've scheduled the design review for next Friday at 2pm.",
    "environment": {"user_time": "2024-02-01T09:15:00"},
    "trace": [
      {
        "event": "tool_call",
        "id": "call_2",
        "tool": "calendar",
        "params": {
          "day": 9, "month": 2, "year": 2024, "hour": 14, "minute": 0,
          "title": "MorganaBench format design review",
          "attendees": "[email protected]"
        }
      },
      {
        "event": "tool_result",
        "id": "call_2",
        "result": {"status": "created", "event_id": "evt_123", "calendar": "primary"}
      }
    ]
  }
}

Notes:

The id links a tool_call event to its corresponding tool_result. This aligns with how traces are extracted by iterating over agentic loop events in frameworks such as LangGraph and OpenAI Agents SDK. If you are extracting a trace from an observability tool, the tool call is already coupled with the result, and you are expected to split them into two trace events.

5) Multi-turn conversations: history vs the last turn

inputs.messages can also describe a multi-turn conversation. The SDK expects:

The last message has role "user" - this is the “current question”.
Any earlier messages are treated as history (context).

Evaluation is expected to focus on the agent’s response to the last user message, while earlier turns provide context. When running your agent, you are expected to provide the entire history to your agent to simulate a conversation.

Executed multi-turn example:

{
  "inputs": {
    "messages": [
      {"role": "user", "content": "I'm planning a 3-day trip to Paris. Give me a short itinerary."},
      {"role": "assistant", "content": "Day 1: Louvre + Seine. Day 2: Montmartre. Day 3: Versailles."},
      {"role": "user", "content": "Is Versailles inside Paris?"}
    ]
  },
  "expectations": {
    "expected_response": "No. Versailles is a separate city outside Paris, typically visited as a day trip."
  },
  "outputs": {
    "response": "No. Versailles is a separate city outside of Paris.",
    "trace": [
      {
        "event": "retriever",
        "outputs": [
          {
            "id": "doc_geo_1",
            "page_content": "Versailles is a commune in the Yvelines department, in the Ile-de-France region, about 17 km west of Paris."
          }
        ]
      }
    ]
  }
}

6) Citations: how to represent them in executed benchmarks

To enable citation evaluation, include outputs.citations in the executed benchmark.

Each citation references:

document_id: the id of a retrieved chunk that appeared in a "retriever" trace event
span_from / span_to: character offsets into outputs.response ((span_from) inclusive, (span_to) exclusive)

Executed example with retrieval trace + one citation:

{
  "inputs": {
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
  },
  "expectations": {
    "expected_response": "Paris is the capital and most populous city of France."
  },
  "outputs": {
    "response": "Paris is the capital and most populous city of France.",
    "trace": [
      {
        "event": "retriever",
        "outputs": [
          {
            "id": "doc_1",
            "page_content": "Paris is the capital and most populous city of France. It serves as the cultural center..."
          }
        ]
      }
    ],
    "citations": [{"document_id": "doc_1", "span_from": 0, "span_to": 54}]
  }
}

Important validation rule (enforced by the SDK): every citations[].document_id must match a retrieval outputs[].id that appears somewhere in outputs.trace.

For example, if your response is exactly "Paris is the capital and most populous city of France.", then span_from=0 and span_to=54 cover the entire string.

Python SDK: load and write JSONL with `Example`

In Python, each line is parsed as an Example:

from mb.entities import Example


def load_jsonl(path: str) -> list[Example]:
    with open(path, "r", encoding="utf-8") as f:
        return [Example.model_validate_json(line) for line in f]


def write_jsonl(path: str, examples: list[Example]) -> None:
    with open(path, "w", encoding="utf-8") as f:
        for ex in examples:
            f.write(ex.model_dump_json(exclude_none=True))
            f.write("\n")

This is the same mechanism used throughout the docs: benchmark JSONL is “just” Example without outputs, and executed benchmark JSONL is Example with outputs populated.

Additional resources

The full JSON Schema for Example: Schema and examples (non-Python usage). Useful if you prefer a “complete picture” view, or you’re implementing a validator in a different language
The end-to-end integration tutorial (includes trace building): Tutorial: OpenAI ADK + MLflow, end-to-end