Platform Capabilities

Grading Continuity

Deterministic input/output regression tests can't capture how an agent reasons, adapts, and recovers. Pipelines grades the behavior behind the outcome:

Task Completion: Given a goal and a set of tools, did the agent actually get the job done? Determine Accuracy.
Tool Use: With APIs and live MCP connections on hand, did it reach for the right tool at the right moment — and hold onto what mattered? Understand Trajectory Quality.
Failure Handling: A dependency misfires; an API stalls. When the environment turns unpredictable, does the agent recover gracefully? Measure Resilience.
Complex System Behavior: Dispatch a fleet of agents together. Do they carry context forward, coordinate, and reach the shared objective? Observe System-level Behavior.

Key capabilities

Register and version agents

Agent Library: Register external HTTP agents or in-platform code/sandbox agents, declare tool schemas, and configure per-run timeout and concurrency.
Agent versioning: Track material config/tool updates as new versions so run history stays reproducible.

Simulate and stress test behavior

Odyssey execution modes: Run tools in sandbox simulation, passthrough mode to live services, or failure-injected mode to test recovery paths.
Task seeding: Author scenarios across instruction, behavior instructions, initial state, failure rules, expected outcome, and tracked constraints.
Synthetic generation grounded in real data: Generate synthetic seed tasks in bulk, then ground generation with dataset-backed synthetic profiles so scenarios reflect real operating patterns.
Model controls: Select simulator and judge models independently, with org-level defaults and BYOK model credentials.

Inspect runs with evidence

Trajectory visibility: Inspect tool-call timelines, arguments, responses, and per-call provenance (simulated, injected, passthrough, or error).
Judge + mechanical metrics: Review semantic pass/fail verdicts alongside structural metrics such as completion, schema compliance, and consistency.
State inspection: Analyze ledger/state transitions and final run artifacts to understand why a run passed or failed.

Evaluate sessions and automate workflows

Multi-turn testing: Run model-as-user sessions to evaluate coherence, memory, and cross-turn task completion.
Datasets and comparisons: Store runs in versioned datasets and compare behavior across agent versions or scenario revisions.
API, SDK, and CLI: Automate agent registration, task seeding, run dispatch, and result analysis programmatically.
RBAC: Control access with Org Admin, Project Admin, and Contributor roles.

Grading Continuity

Key capabilities

Register and version agents

Simulate and stress test behavior

Inspect runs with evidence

Evaluate sessions and automate workflows

On this page