Runbooks

All runbooks assume:

Role. Registration and the Agents pages are visible only to Org Admins and Project Admin Owners.
A judge model. Every run is graded by an LLM judge. Pick one per agent field (Models popover in the Pipeline Builder) or set an org default under Settings → Models once and forget it. A run with neither fails as agent_model_unresolved.
Publish. A draft agent can't be dispatched. After saving, click Publish.

Any sandbox agent

1. Sandbox a custom Python agent

Goal: run your own Python code in a platform sandbox, end to end, with no repo and no tools: register, seed one task, and read the trajectory and verdict.

Step 1: Register. In the sidebar, Agents → Register agent, pick the Sandbox Agents mode card. Set a Name (e.g. hello-sandbox). Under How your agent runs, select Python function.

Step 2: Paste the code. In the Code source picker choose Paste code and paste the following module:

import platform
import subprocess
from pathlib import Path


def run(task_input, *, proxy_url, run_token):
    instruction = task_input["user_instruction"]

    # Do some real work so the trajectory has steps to show.
    Path("/tmp/notes.txt").write_text(f"task: {instruction}\n")
    uname = subprocess.run(["uname", "-a"], capture_output=True, text=True)

    return {
        "final_response": (
            f"Hello from the sandbox (Python {platform.python_version()}). "
            f"You asked: {instruction!r}. Kernel: {uname.stdout.strip()}"
        )
    }

The contract: a top-level callable taking (task_input, *, proxy_url, run_token), returning {"final_response": ...} (or a plain string, which the platform wraps). An unhandled exception is captured and graded, it doesn't count as an infra failure.

The sandbox cannot import the Pipelines SDK, write SDK-free code. A real agent calls your model provider directly: declare the package (e.g. anthropic) under Sandbox environment (advanced) → Python requirements and the API key as an Environment variable → From credential (runbook 6 covers the environment; the credential path is the only encrypted one).

Step 3: Entrypoint. Leave Entrypoint as run (it must name a top-level callable, dotted paths are rejected). Save, then Publish.

Step 4: Wire and seed. In the Pipeline Builder, add an agent-mode field and select hello-sandbox. In the field's Models popover pick a Judge model (skip if your org default is set). The seed's user column is always required, create one task with this CSV:

user
"Summarize what environment you are running in."

Verify, open the task from the Data Explorer (View trace). The run completes, the Agent Trace tab shows the final response, a Judge verdict (PASS/FAIL with rubric reasoning), and a Trajectory that fills in after the run completes, for this example, a file write and a shell step. No diff and no scorer badges: there's no workspace by design.

If it fails

Error	Cause	Fix
FAILED `agent_model_unresolved`	No judge model is resolvable for the run.	Pick one on the field's Models popover or set the org default.
Trajectory is empty	Capture is best-effort and keys off real syscalls like `subprocess`, `open`, and `httpx`; pure compute returns can produce few steps.	Trigger at least one real filesystem, process, or network action in the agent path you are testing.
You expected a diff	Plain sandbox runs have no workspace attached.	Attach a coding scenario (runbook 7).

2. Ship agent code as a ZIP

Goal: Run a multi-file agent when inline paste limits are exceeded (200 KB for pasted code, 1 MB across Multiple files) or when non-.py assets are required.

You'll need: A .zip archive of your agent (≤ 100 MB compressed, ≤ 500 MB uncompressed) and Org Admin permissions in the agent's organization. Archive uploads are organization-scoped and are not granted by project-level roles.

Step 1: Structure the archive. A single top-level folder is flattened on unzip, so both of these land the same way in /home/user/agent:

my-agent.zip                 my-agent.zip
├── main.py                  └── my-agent/
└── helpers/                     ├── main.py
    └── prompts.py               └── helpers/
                                     └── prompts.py

main.py defines the entrypoint, exactly as in runbook 1:

from helpers.prompts import SYSTEM_PROMPT


def run(task_input, *, proxy_url, run_token):
    return {"final_response": f"{SYSTEM_PROMPT}: {task_input['user_instruction']}"}

Step 2: Upload. Register agent → Sandbox Agents → Python function, then in the Code source picker choose Upload ZIP and drop the file. The form uploads it to storage with a short-lived signed URL and blocks submit until the upload confirms, only the confirmed archive id is saved on the agent, never the bytes.

Step 3: Entrypoint. Set Entrypoint to run and Entrypoint file to main.py (the .py inside the zip that defines it). Save and Publish.

The zip's contents aren't known at save time, so Entrypoint file is only shape-checked on save and verified inside the sandbox at dispatch, a wrong path fails the run, not the save.

Step 4: Seed and dispatch exactly as in runbook 1, steps 4–5.

Verify, at dispatch the archive is validated (size caps, zip-slip and symlink guards) and unzipped into the agent directory /home/user/agent, never into the graded workspace. The run produces a trajectory just like a pasted-code run. Fetch and validation happen before a sandbox boots, so a bad archive costs nothing.

If it fails

Error	Cause	Fix
FAILED `agent_code_fetch_failed` before any sandbox	The file is unconfirmed, cross-org, or failed archive validation (size, zip-slip, or symlink checks).	Re-upload the archive into the agent's organization and wait for upload confirmation.
Run fails on the entrypoint probe	The configured Entrypoint file path does not match a valid path inside the zip.	Correct the path, accounting for top-level folder flattening, then run again.
Register stays disabled	The upload is still in progress.	Wait for upload confirmation and avoid reloading during upload.

3. Pull agent code from a private git repo

Goal: Clone agent source code from a private repository at dispatch time, while ensuring the token never appears in output, logs, or stored configuration.

You'll need: An https clone URL and a PAT, or an organization credential that stores the PAT.

Step 1: Point at the repo. Register agent → Sandbox Agents → Python function, Code source picker → Git repository. Enter the Repository URL (https only, SSH URLs and user:pass@ userinfo are rejected, and an SSRF guard blocks private/loopback hosts) and an optional Ref (optional), branch, tag, or commit SHA, e.g. v1.2.0.

Step 2: Auth. Pick one of the three Auth modes:

None, public repo.
From credential, an existing org credential; decrypted server-side at dispatch.
Inline token, paste the PAT once. It's write-only: stored in a hidden platform-managed credential, never echoed back (edit mode shows rotate-or-keep, never the token).

Step 3: Entrypoint. Set Entrypoint (run) and Entrypoint file (the module path inside the repo, default main.py). Save, Publish, then seed and dispatch as in runbook 1, step 4.

Verify, the PAT is decrypted only when the clone command is built, injected into the clone URL, and the remote is dropped right after the clone. It's registered as a run secret, so it renders as *** on every trajectory and final-response surface, even if your agent echoes it. The clone resolves before a sandbox boots; on later turns of a multi-turn session the populated sandbox is reused without re-cloning.

If it fails

Error	Cause	Fix
FAILED `agent_code_fetch_failed`	Agent code checkout is strict, so a missing branch, tag, or SHA is treated as a hard error. Clone failures can also come from an invalid URL or PAT.	Correct the Ref value, or fix the repository URL or PAT if clone authentication failed.
FAILED `agent_secret_unresolved`	The git credential is missing or cannot be decrypted.	Re-bind From credential or re-paste the Inline token.
Save rejected	The configuration uses a non-https URL, includes embedded `user:pass@`, or sets both a credential and an inline token.	Use https only and configure exactly one authentication method.

4. Give a sandboxed agent platform tools

Goal: Enable a sandbox agent to call platform tools, either simulated against seeded world state or passed through to a registered endpoint, with each call recorded as a ledger row.

You'll need: A registered sandbox agent (runbooks 1–3) or a coding CLI agent (runbook 7).

Step 1: Declare the tools. On the agent form's Tools step, add each tool's name and input_schema and pick its execution mode, sandbox (the platform simulates the response from world state) or passthrough (forwarded verbatim to a bound endpoint). Or import in bulk: Import JSON for a raw array, Import from MCP to discover a connected server's catalog. A minimal sandbox-mode tool:

{
  "name": "get_order",
  "description": "Look up an order by id.",
  "input_schema": {
    "type": "object",
    "properties": { "order_id": { "type": "string" } },
    "required": ["order_id"]
  }
}

Field rules and passthrough bindings: Tools schema.

Step 2: Call them from a Python-function agent. No SDK, no MCP, raw HTTP against the per-run proxy. Declare httpx under Python requirements and paste:

import os

import httpx


def call_tool(name, args):
    r = httpx.post(
        f"{os.environ['PIPELINES_ODYSSEY_PROXY_URL']}/tools/{name}",
        headers={"Authorization": f"Bearer {os.environ['PIPELINES_RUN_TOKEN']}"},
        json=args,
        timeout=60,
    )
    r.raise_for_status()
    return r.json()


def run(task_input, *, proxy_url, run_token):
    order = call_tool("get_order", {"order_id": "4521"})
    return {"final_response": f"Order status: {order}"}

(proxy_url / run_token arrive both as kwargs and as the PIPELINES_* env vars, use either.)

Step 2′: Or let a CLI harness reach them. A shell-command agent (runbook 7) gets the same tools through the pipelines MCP shim, auto-registered into the harness when tools are attached, nothing to wire. Claude Code, Codex, and Cursor are supported; details in MCP tools and Harness customization.

Verify, run a task and open its trace: each tool call is a ledger row with the arguments, the response, and a source badge, simulated, passthrough, injected (a failure rule fired), or error, plus a step-by-step world-state diff.

If it fails

Error	Cause	Fix
`tool_not_executable` on a coding run	Coding runs do not silently simulate tool execution.	Bind the tool to a real passthrough endpoint, or have the agent perform the work directly in the repository.
`503 lock_contention` on parallel calls	Tool calls are serialized per run and concurrent calls contend on the same lock.	Retry the call after contention clears.
Tools dropped with a warning at dispatch	The run uses Aider (no MCP support) or an unrecognized run command.	Use Claude Code, Codex, or Cursor, or call the proxy directly from your own code as in Step 2.

5. Run a multi-turn conversation

Goal: Run a simulated user through multiple turns against the agent within one persistent session, and obtain a session-level verdict.

You'll need: A registered agent with healthy one-shot runs (runbook 1).

Step 1: Configure the simulator. On the agent field, open the Multi-turn popover and set:

simulator_mode, persona (an LLM plays the user; give it a short persona describing goals, tone, escalation) or scripted (a fixed JSON array of user turns, replayed in order).
max_turns, hard cap, 1..50 (default 10; 6–10 is a good start).
memory_mode, replay (default; the platform re-sends the transcript in input.messages each turn) or stateful (your agent keeps its own memory, keyed by input.session_id).

The popover sets turn_mode = model_as_user on the seed for you. To vary any of these per row instead, opt each column in under the popover's Vary per row from CSV section, a wired column becomes a required CSV column at upload:

user,turn_mode,max_turns,simulator_mode,memory_mode,user_simulator_persona
"Help me fix my failed refund for order #4521.","model_as_user","8","persona","replay","Frustrated but cooperative; expects the agent to remember order details and not repeat verification."

Step 2: Dispatch and read. Run the task and open it: multi-turn rows show the canonical transcript across turns, a per-turn trajectory, and the session-level judge verdict over the whole conversation.

Verify, the session terminates when the simulator signals done, max_turns is hit, or a termination_keyword appears in an agent reply. On a coding session (runbook 7 + this one) the sandbox and repo are seeded once at turn 0 and carried forward; each turn's diff is cumulative against the original baseline, and the session judge sees the cumulative diff alongside the transcript. Full axis reference: Multi-turn testing.

If it fails

Error	Cause	Fix
Session FAILED `simulator_model_unresolved`	No user-simulator model is configured for the run.	Pick one on the field's Models popover or set the org default.
Ran single-shot despite the popover	A `turn_mode` typo (`multi_turn`, `model_as_users`) or a stale seed causes fallback to one-shot mode.	Use the exact value `model_as_user`, then re-check column mapping and re-seed.
Session ends immediately	`simulator_mode = scripted` is set with an empty or non-JSON-array `scripted_user_turns` value.	Provide `scripted_user_turns` as a JSON array of strings.

6. Customize the sandbox runtime

Goal, add system packages, Python deps, a boot step, or a persistent custom image to any code agent. The default image is Python 3.13 with git, ripgrep, unzip, uv, and pytest, start here only if that's not enough.

You'll need, a registered sandbox agent (any of runbooks 1–3, 7).

Step 1: Boot-time layering (per-run, no image build). Open Sandbox environment (advanced) and set any of:

System packages (one per line), apt packages (≤ 50), installed as root once at sandbox boot, before your agent.
Setup command, one shell command run at sandbox start, after package installs, with your resolved env (so it can use a credential-backed token). A nonzero exit fails the run.
Python requirements (one per line) and Python version (3.9–3.13; blank = 3.13), Python-function agents only; pip-installed before your agent runs.

Step 2: Persistent image (for heavier tooling). Switch Base image to Custom Dockerfile and write only the body, the platform prepends FROM pipelines-workspace-base:

RUN sudo apt-get update && sudo apt-get install -y jq
RUN sudo npm install -g @anthropic-ai/claude-code
ENV NODE_OPTIONS=--max-old-space-size=4096

Constraints: only RUN / ENV / WORKDIR (no COPY/ADD, there's no build context, and no second FROM, ENTRYPOINT, CMD, or USER); ≤ 32 KB of text. Two gotchas that account for most build failures:

The build runs as a normal user, global installs (npm install -g, system pip, apt-get) need sudo (passwordless in the base image).
The base is Python 3.13, Python CLIs pinning old deps won't build there. Use uv tool install --python 3.11 <tool> and call the binary by full path (/home/user/.local/bin/<tool>).

Step 3: Build. Leave Build image now checked when saving (or click Build image on the agent detail page, building is always an explicit action). The Custom image card shows the chip: Not built → Building… (live log) → Ready (or Build failed with the log tail and a Rebuild button). An identical Dockerfile already built in your org is reused without rebuilding.

Verify, wait for Ready, then dispatch. Runs with a custom Dockerfile are pinned to the built image: while Building… or Build failed, dispatch is a hard error, never a silent fallback to the default image.

If it fails

Error	Cause	Fix
FAILED `environment_setup_failed`	A System packages install step or the Setup command exited nonzero.	Fix the failing command, remembering it runs as the default user and requires `sudo` for system-level changes.
Chip shows Build failed	The Dockerfile build failed, commonly due to missing `sudo` or a Python 3.13-incompatible package install.	Open the failure log on the agent page, correct the Dockerfile, then click Rebuild.
FAILED `image_not_ready` / `image_build_failed`	Dispatch was attempted before the image reached Ready status.	Build or rebuild the image first, then run again.
Build request returns 409	Only one build can run at a time per organization.	Wait for the in-flight build to complete, then retry.

Coding agents

7. Run a coding CLI agent on a repo task

Goal, register Claude Code (or Codex / Cursor / Aider) as an agent, point it at a seeded repo, and read the trajectory, final diff, scorer badges, and verdict.

You'll need, a model-provider API key stored as an org credential, and a repo for the agent to work on (any public repo works for a first run).

Step 1: Register. Agents → Register agent → Sandbox Agents. Under How your agent runs, pick Shell command (any CLI).

Step 2: Run command. Click the Claude Code preset chip, it fills in:

claude -p "$(cat $PIPELINES_TASK_FILE)"

The platform writes the task brief to $PIPELINES_TASK_FILE and, for a recognized CLI, appends the headless/approval flags for you (--print for Claude Code, --yes-always for Aider, --force for Cursor). Call the binary directly, wrapping the command in bash -c "…" is the one form that isn't recognized, and it costs you the rich trajectory.

Step 3: Install the CLI in the image. The base image ships no coding CLIs. Under Sandbox environment (advanced) → Base image → Custom Dockerfile, add the install line and leave Build image now checked:

CLI	Install line
Claude Code	`RUN sudo npm install -g @anthropic-ai/claude-code`
Codex	`RUN sudo npm install -g @openai/codex`
Cursor	`RUN curl https://cursor.com/install -fsSL \| bash`
Aider	`RUN uv tool install --python 3.11 aider-chat`

Step 4: Key. Add an Environment variable row, mode From credential (the only encrypted path): ANTHROPIC_API_KEY for Claude Code, OPENAI_API_KEY for Codex, CURSOR_API_KEY for Cursor, the --model provider's key for Aider. Also raise the Run timeout, the 300 s default is tight for coding runs (max 1800 s). Save and Publish; wait for the Custom image chip to reach Ready.

Step 5: Attach a coding setup and seed. A shell-command agent always needs a workspace. In the Pipeline Builder, select the agent on an agent-mode field and open the field's Coding setup popover: pick the Git URL workspace tile and enter your repo's URL, add an optional Setup command (runs before the baseline commit, so it stays out of the graded diff), and Scorers rows if you want mechanical gates. Pick a Judge model, then seed one task:

user
"Fix the failing test in tests/test_parser.py and make the suite pass."

Verify, the platform seeds the repo at /home/user/workspace, commits a baseline, runs your command, and the task page shows all four surfaces: a Trajectory timeline (Shell steps with output and exit codes, Edit steps as red/green diffs, Read/Search, Assistant reasoning, it fills in after the run completes), the Final diff against the baseline, scorer badges, and the Judge verdict. Full UI tour: Inspecting runs.

If it fails

Error	Cause	Fix
FAILED `agent_command_failed` (in the error detail)	The command exited nonzero and left an empty diff, typically because the CLI waited on an interactive prompt.	Use a recognized binary so headless flags are appended automatically, or pass equivalent headless flags manually. (A nonzero exit with a real diff is not treated as failure; scorers and the judge determine outcome.)
FAILED `in_sandbox_requires_workspace`	No coding scenario is attached to the task.	Attach a coding scenario via the Coding setup popover (Step 5), then re-seed.
FAILED `image_not_ready`	The custom image is still building.	Wait for Ready status (Step 4), then run again.
Trajectory empty but diff present	An unrecognized run command (for example, `bash -c "…"`) triggered coarse fallback capture.	Invoke the CLI binary directly.
Banner "Eval phase failed, diff/scorers may be missing"	Diff or scorer capture failed, but the run itself completed and trajectory data is still available.	Expand the banner to inspect the redacted failure cause and re-run if needed.

8. Seed many coding tasks from CSV

Goal, define the repo, setup, and scorers once on the agent field, then fan a CSV of prompts out into many tasks, each frozen against that definition.

You'll need, a coding agent (runbook 7) and a CSV of task prompts.

Step 1: Define the coding setup. On the agent-mode field, open the Coding setup popover: pick the Workspace source (Git URL / Upload ZIP / Empty), an optional Setup, and Scorers rows. The definition saves onto the field, and every task seeded from the workflow starts from it.

The two Setup modes differ in whether the setup work is graded: Platform runs a command executes the Setup command before the baseline commit, so installs and fixtures never pollute the agent's diff; Agent sets up sends the Setup instructions to the agent, whose setup work does land in the graded diff. Use platform mode unless setting up is itself the task.

On the workspace seed path, a bad git Ref silently falls back to the repo's default branch, the run proceeds with no error (unlike the strict agent-code checkout of runbook 3). Double-check branch/tag/SHA spelling.

Step 2: Upload the CSV. The CSV carries the field's wired seed columns, user is always required, one prompt per row:

user
"Fix the failing test in tests/test_parser.py."
"Add a --json flag to the CLI entrypoint."
"Refactor parse() to return a dataclass without changing behavior."

Each row becomes one task carrying a deep-copied, frozen snapshot of the field's coding setup at creation time. Editing the setup later never changes already-seeded tasks, re-seed to pick up edits.

Step 3: Per-row scenario control (API only). A saved, org-scoped scenario library lives at /api/coding-scenarios (names unique per org), and the seeding service accepts per-row CSV axes on top of it: scenario_ref picks a saved scenario by name per row, and workspace_seed / setup_command / eval override it per row (a row cell always wins; the scenario fills gaps). These columns flow only when the workflow's agent field wires them in agentConfig.odyssey_seed_columns via the workflow API, the builder UI doesn't expose toggles for them; its UI path is the Coding setup popover of Step 1.

Verify, one task per CSV row in the Data Explorer, each with the frozen workspace seed, setup, and scorers; open any row to confirm the seed snapshot on the trace tab. Full axis semantics: Coding scenarios.

If it fails

Error	Cause	Fix
A row fails with no task created (API path)	A present-but-blank or malformed `scenario_ref`, `workspace_seed`, or `eval` cell triggers a hard-error axis. Coding tasks are not silently downgraded to non-workspace runs.	Correct the malformed cell. Omitting the entire column is valid if that axis is not needed.
`ScenarioResolutionError` (API path)	The `scenario_ref` value is unknown, cross-org, or archived.	Use the exact name of a live scenario in the task's organization. Use `GET /api/coding-scenarios?include_archived=true` to locate archived entries.
`POST /api/coding-scenarios` returns 409	Scenario name collision occurred because names are unique per organization.	Rename the scenario and retry creation.
FAILED `workspace_seed_failed`	The repository could not be cloned or unzipped in the sandbox due to bad URL/host, disallowed scheme, invalid Subdirectory (optional), or archive validation failure.	Correct the workspace seed configuration and run again.
FAILED `workspace_setup_failed`	The platform-mode Setup command failed after repository seeding.	Reproduce the setup command locally against the same repository, fix the failure, then re-run.

On this page