Pipelines Docs is in beta — content is actively being added.
Getting Started

Platform Capabilities

Agent Tracing with Pipelines

Grading Continuity

Deterministic input/output regression tests can't capture how an agent reasons, adapts, and recovers. Pipelines grades the behavior behind the outcome:

  • Task Completion: Given a goal and a set of tools, did the agent actually get the job done? Determine Accuracy.
  • Tool Use: With APIs and live MCP connections on hand, did it reach for the right tool at the right moment — and hold onto what mattered? Understand Trajectory Quality.
  • Failure Handling: A dependency misfires; an API stalls. When the environment turns unpredictable, does the agent recover gracefully? Measure Resilience.
  • Complex System Behavior: Dispatch a fleet of agents together. Do they carry context forward, coordinate, and reach the shared objective? Observe System-level Behavior.

Key capabilities

Register and version agents

  • Agent Library: Register external HTTP agents or in-platform code/sandbox agents, declare tool schemas, and configure per-run timeout and concurrency.
  • Agent versioning: Track material config/tool updates as new versions so run history stays reproducible.

Simulate and stress test behavior

  • Odyssey execution modes: Run tools in sandbox simulation, passthrough mode to live services, or failure-injected mode to test recovery paths.
  • Task seeding: Author scenarios across instruction, behavior instructions, initial state, failure rules, expected outcome, and tracked constraints.
  • Synthetic generation grounded in real data: Generate synthetic seed tasks in bulk, then ground generation with dataset-backed synthetic profiles so scenarios reflect real operating patterns.
  • Model controls: Select simulator and judge models independently, with org-level defaults and BYOK model credentials.

Inspect runs with evidence

  • Trajectory visibility: Inspect tool-call timelines, arguments, responses, and per-call provenance (simulated, injected, passthrough, or error).
  • Judge + mechanical metrics: Review semantic pass/fail verdicts alongside structural metrics such as completion, schema compliance, and consistency.
  • State inspection: Analyze ledger/state transitions and final run artifacts to understand why a run passed or failed.

Evaluate sessions and automate workflows

  • Multi-turn testing: Run model-as-user sessions to evaluate coherence, memory, and cross-turn task completion.
  • Datasets and comparisons: Store runs in versioned datasets and compare behavior across agent versions or scenario revisions.
  • API, SDK, and CLI: Automate agent registration, task seeding, run dispatch, and result analysis programmatically.
  • RBAC: Control access with Org Admin, Project Admin, and Contributor roles.