Runbooks
End-to-end walkthroughs of agent setup.
All runbooks assume:
- Role. Registration and the Agents pages are visible only to Org Admins and Project Admin Owners.
- A judge model. Every run is graded by an LLM judge. Pick one per agent
field (Models popover in the Pipeline Builder) or set an org default
under Settings → Models once and forget it. A run with neither fails as
agent_model_unresolved. - Publish. A draft agent can't be dispatched. After saving, click Publish.
Any sandbox agent
1. Sandbox a custom Python agent
Goal: run your own Python code in a platform sandbox, end to end, with no repo and no tools: register, seed one task, and read the trajectory and verdict.
Step 1: Register. In the sidebar, Agents → Register agent, pick the
Sandbox Agents mode card. Set a Name (e.g. hello-sandbox). Under
How your agent runs, select Python function.
Step 2: Paste the code. In the Code source picker choose Paste code and paste the following module:
import platform
import subprocess
from pathlib import Path
def run(task_input, *, proxy_url, run_token):
instruction = task_input["user_instruction"]
# Do some real work so the trajectory has steps to show.
Path("/tmp/notes.txt").write_text(f"task: {instruction}\n")
uname = subprocess.run(["uname", "-a"], capture_output=True, text=True)
return {
"final_response": (
f"Hello from the sandbox (Python {platform.python_version()}). "
f"You asked: {instruction!r}. Kernel: {uname.stdout.strip()}"
)
}The contract: a top-level callable taking (task_input, *, proxy_url, run_token), returning {"final_response": ...} (or a plain string, which the
platform wraps). An unhandled exception is captured and graded, it doesn't
count as an infra failure.
The sandbox cannot import the Pipelines SDK, write SDK-free code. A
real agent calls your model provider directly: declare the package (e.g.
anthropic) under Sandbox environment (advanced) → Python requirements
and the API key as an Environment variable → From credential (runbook 6
covers the environment; the credential path is the only encrypted one).
Step 3: Entrypoint. Leave Entrypoint as run (it must name a
top-level callable, dotted paths are rejected). Save, then Publish.
Step 4: Wire and seed. In the Pipeline Builder, add an agent-mode
field and select hello-sandbox. In the field's Models popover pick a
Judge model (skip if your org default is set). The seed's user column is
always required, create one task with this CSV:
user
"Summarize what environment you are running in."Verify, open the task from the Data Explorer (View trace). The run completes, the Agent Trace tab shows the final response, a Judge verdict (PASS/FAIL with rubric reasoning), and a Trajectory that fills in after the run completes, for this example, a file write and a shell step. No diff and no scorer badges: there's no workspace by design.
If it fails
| Error | Cause | Fix |
|---|---|---|
FAILED agent_model_unresolved | No judge model is resolvable for the run. | Pick one on the field's Models popover or set the org default. |
| Trajectory is empty | Capture is best-effort and keys off real syscalls like subprocess, open, and httpx; pure compute returns can produce few steps. | Trigger at least one real filesystem, process, or network action in the agent path you are testing. |
| You expected a diff | Plain sandbox runs have no workspace attached. | Attach a coding scenario (runbook 7). |
2. Ship agent code as a ZIP
Goal: Run a multi-file agent when inline paste limits are exceeded (200 KB
for pasted code, 1 MB across Multiple files) or when non-.py assets are
required.
You'll need: A .zip archive of your agent (≤ 100 MB compressed, ≤ 500 MB
uncompressed) and Org Admin permissions in the agent's organization.
Archive uploads are organization-scoped and are not granted by project-level
roles.
Step 1: Structure the archive. A single top-level folder is flattened on
unzip, so both of these land the same way in /home/user/agent:
my-agent.zip my-agent.zip
├── main.py └── my-agent/
└── helpers/ ├── main.py
└── prompts.py └── helpers/
└── prompts.pymain.py defines the entrypoint, exactly as in runbook 1:
from helpers.prompts import SYSTEM_PROMPT
def run(task_input, *, proxy_url, run_token):
return {"final_response": f"{SYSTEM_PROMPT}: {task_input['user_instruction']}"}Step 2: Upload. Register agent → Sandbox Agents → Python function, then in the Code source picker choose Upload ZIP and drop the file. The form uploads it to storage with a short-lived signed URL and blocks submit until the upload confirms, only the confirmed archive id is saved on the agent, never the bytes.
Step 3: Entrypoint. Set Entrypoint to run and Entrypoint file
to main.py (the .py inside the zip that defines it). Save and Publish.
The zip's contents aren't known at save time, so Entrypoint file is only shape-checked on save and verified inside the sandbox at dispatch, a wrong path fails the run, not the save.
Step 4: Seed and dispatch exactly as in runbook 1, steps 4–5.
Verify, at dispatch the archive is validated (size caps, zip-slip and
symlink guards) and unzipped into the agent directory /home/user/agent,
never into the graded workspace. The run produces a trajectory just like a
pasted-code run. Fetch and validation happen before a sandbox boots, so a
bad archive costs nothing.
If it fails
| Error | Cause | Fix |
|---|---|---|
FAILED agent_code_fetch_failed before any sandbox | The file is unconfirmed, cross-org, or failed archive validation (size, zip-slip, or symlink checks). | Re-upload the archive into the agent's organization and wait for upload confirmation. |
| Run fails on the entrypoint probe | The configured Entrypoint file path does not match a valid path inside the zip. | Correct the path, accounting for top-level folder flattening, then run again. |
| Register stays disabled | The upload is still in progress. | Wait for upload confirmation and avoid reloading during upload. |
3. Pull agent code from a private git repo
Goal: Clone agent source code from a private repository at dispatch time, while ensuring the token never appears in output, logs, or stored configuration.
You'll need: An https clone URL and a PAT, or an organization credential that stores the PAT.
Step 1: Point at the repo. Register agent → Sandbox Agents → Python
function, Code source picker → Git repository. Enter the
Repository URL (https only, SSH URLs and user:pass@ userinfo are
rejected, and an SSRF guard blocks private/loopback hosts) and an optional
Ref (optional), branch, tag, or commit SHA, e.g. v1.2.0.
Step 2: Auth. Pick one of the three Auth modes:
- None, public repo.
- From credential, an existing org credential; decrypted server-side at dispatch.
- Inline token, paste the PAT once. It's write-only: stored in a hidden platform-managed credential, never echoed back (edit mode shows rotate-or-keep, never the token).
Step 3: Entrypoint. Set Entrypoint (run) and Entrypoint file
(the module path inside the repo, default main.py). Save, Publish, then
seed and dispatch as in runbook 1, step 4.
Verify, the PAT is decrypted only when the clone command is built,
injected into the clone URL, and the remote is dropped right after the clone.
It's registered as a run secret, so it renders as *** on every trajectory
and final-response surface, even if your agent echoes it. The clone resolves
before a sandbox boots; on later turns of a multi-turn session the
populated sandbox is reused without re-cloning.
If it fails
| Error | Cause | Fix |
|---|---|---|
FAILED agent_code_fetch_failed | Agent code checkout is strict, so a missing branch, tag, or SHA is treated as a hard error. Clone failures can also come from an invalid URL or PAT. | Correct the Ref value, or fix the repository URL or PAT if clone authentication failed. |
FAILED agent_secret_unresolved | The git credential is missing or cannot be decrypted. | Re-bind From credential or re-paste the Inline token. |
| Save rejected | The configuration uses a non-https URL, includes embedded user:pass@, or sets both a credential and an inline token. | Use https only and configure exactly one authentication method. |
4. Give a sandboxed agent platform tools
Goal: Enable a sandbox agent to call platform tools, either simulated against seeded world state or passed through to a registered endpoint, with each call recorded as a ledger row.
You'll need: A registered sandbox agent (runbooks 1–3) or a coding CLI agent (runbook 7).
Step 1: Declare the tools. On the agent form's Tools step, add each
tool's name and input_schema and pick its execution mode, sandbox
(the platform simulates the response from world state) or passthrough
(forwarded verbatim to a bound endpoint). Or import in bulk: Import JSON
for a raw array, Import from MCP to discover a connected server's catalog.
A minimal sandbox-mode tool:
{
"name": "get_order",
"description": "Look up an order by id.",
"input_schema": {
"type": "object",
"properties": { "order_id": { "type": "string" } },
"required": ["order_id"]
}
}Field rules and passthrough bindings: Tools schema.
Step 2: Call them from a Python-function agent. No SDK, no MCP, raw
HTTP against the per-run proxy. Declare httpx under Python requirements
and paste:
import os
import httpx
def call_tool(name, args):
r = httpx.post(
f"{os.environ['PIPELINES_ODYSSEY_PROXY_URL']}/tools/{name}",
headers={"Authorization": f"Bearer {os.environ['PIPELINES_RUN_TOKEN']}"},
json=args,
timeout=60,
)
r.raise_for_status()
return r.json()
def run(task_input, *, proxy_url, run_token):
order = call_tool("get_order", {"order_id": "4521"})
return {"final_response": f"Order status: {order}"}(proxy_url / run_token arrive both as kwargs and as the PIPELINES_* env
vars, use either.)
Step 2′: Or let a CLI harness reach them. A shell-command agent (runbook
7) gets the same tools through the pipelines MCP shim, auto-registered into
the harness when tools are attached, nothing to wire. Claude Code, Codex, and
Cursor are supported; details in
MCP tools and
Harness customization.
Verify, run a task and open its trace: each tool call is a ledger row with the arguments, the response, and a source badge, simulated, passthrough, injected (a failure rule fired), or error, plus a step-by-step world-state diff.
If it fails
| Error | Cause | Fix |
|---|---|---|
tool_not_executable on a coding run | Coding runs do not silently simulate tool execution. | Bind the tool to a real passthrough endpoint, or have the agent perform the work directly in the repository. |
503 lock_contention on parallel calls | Tool calls are serialized per run and concurrent calls contend on the same lock. | Retry the call after contention clears. |
| Tools dropped with a warning at dispatch | The run uses Aider (no MCP support) or an unrecognized run command. | Use Claude Code, Codex, or Cursor, or call the proxy directly from your own code as in Step 2. |
5. Run a multi-turn conversation
Goal: Run a simulated user through multiple turns against the agent within one persistent session, and obtain a session-level verdict.
You'll need: A registered agent with healthy one-shot runs (runbook 1).
Step 1: Configure the simulator. On the agent field, open the Multi-turn popover and set:
simulator_mode, persona (an LLM plays the user; give it a short persona describing goals, tone, escalation) or scripted (a fixed JSON array of user turns, replayed in order).max_turns, hard cap,1..50(default 10; 6–10 is a good start).memory_mode, replay (default; the platform re-sends the transcript ininput.messageseach turn) or stateful (your agent keeps its own memory, keyed byinput.session_id).
The popover sets turn_mode = model_as_user on the seed for you. To vary any
of these per row instead, opt each column in under the popover's Vary per
row from CSV section, a wired column becomes a required CSV column at
upload:
user,turn_mode,max_turns,simulator_mode,memory_mode,user_simulator_persona
"Help me fix my failed refund for order #4521.","model_as_user","8","persona","replay","Frustrated but cooperative; expects the agent to remember order details and not repeat verification."Step 2: Dispatch and read. Run the task and open it: multi-turn rows show the canonical transcript across turns, a per-turn trajectory, and the session-level judge verdict over the whole conversation.
Verify, the session terminates when the simulator signals done,
max_turns is hit, or a termination_keyword appears in an agent reply. On a
coding session (runbook 7 + this one) the sandbox and repo are seeded once
at turn 0 and carried forward; each turn's diff is cumulative against the
original baseline, and the session judge sees the cumulative diff alongside
the transcript. Full axis reference:
Multi-turn testing.
If it fails
| Error | Cause | Fix |
|---|---|---|
Session FAILED simulator_model_unresolved | No user-simulator model is configured for the run. | Pick one on the field's Models popover or set the org default. |
| Ran single-shot despite the popover | A turn_mode typo (multi_turn, model_as_users) or a stale seed causes fallback to one-shot mode. | Use the exact value model_as_user, then re-check column mapping and re-seed. |
| Session ends immediately | simulator_mode = scripted is set with an empty or non-JSON-array scripted_user_turns value. | Provide scripted_user_turns as a JSON array of strings. |
6. Customize the sandbox runtime
Goal, add system packages, Python deps, a boot step, or a persistent
custom image to any code agent. The default image is Python 3.13 with git,
ripgrep, unzip, uv, and pytest, start here only if that's not enough.
You'll need, a registered sandbox agent (any of runbooks 1–3, 7).
Step 1: Boot-time layering (per-run, no image build). Open Sandbox environment (advanced) and set any of:
- System packages (one per line), apt packages (≤ 50), installed as root once at sandbox boot, before your agent.
- Setup command, one shell command run at sandbox start, after package installs, with your resolved env (so it can use a credential-backed token). A nonzero exit fails the run.
- Python requirements (one per line) and Python version (3.9–3.13; blank = 3.13), Python-function agents only; pip-installed before your agent runs.
Step 2: Persistent image (for heavier tooling). Switch Base image to
Custom Dockerfile and write only the body, the platform prepends
FROM pipelines-workspace-base:
RUN sudo apt-get update && sudo apt-get install -y jq
RUN sudo npm install -g @anthropic-ai/claude-code
ENV NODE_OPTIONS=--max-old-space-size=4096Constraints: only RUN / ENV / WORKDIR (no COPY/ADD, there's no
build context, and no second FROM, ENTRYPOINT, CMD, or USER); ≤ 32 KB
of text. Two gotchas that account for most build failures:
- The build runs as a normal user, global installs (
npm install -g, systempip,apt-get) needsudo(passwordless in the base image). - The base is Python 3.13, Python CLIs pinning old deps won't build
there. Use
uv tool install --python 3.11 <tool>and call the binary by full path (/home/user/.local/bin/<tool>).
Step 3: Build. Leave Build image now checked when saving (or click Build image on the agent detail page, building is always an explicit action). The Custom image card shows the chip: Not built → Building… (live log) → Ready (or Build failed with the log tail and a Rebuild button). An identical Dockerfile already built in your org is reused without rebuilding.
Verify, wait for Ready, then dispatch. Runs with a custom Dockerfile are pinned to the built image: while Building… or Build failed, dispatch is a hard error, never a silent fallback to the default image.
If it fails
| Error | Cause | Fix |
|---|---|---|
FAILED environment_setup_failed | A System packages install step or the Setup command exited nonzero. | Fix the failing command, remembering it runs as the default user and requires sudo for system-level changes. |
| Chip shows Build failed | The Dockerfile build failed, commonly due to missing sudo or a Python 3.13-incompatible package install. | Open the failure log on the agent page, correct the Dockerfile, then click Rebuild. |
FAILED image_not_ready / image_build_failed | Dispatch was attempted before the image reached Ready status. | Build or rebuild the image first, then run again. |
| Build request returns 409 | Only one build can run at a time per organization. | Wait for the in-flight build to complete, then retry. |
Coding agents
7. Run a coding CLI agent on a repo task
Goal, register Claude Code (or Codex / Cursor / Aider) as an agent, point it at a seeded repo, and read the trajectory, final diff, scorer badges, and verdict.
You'll need, a model-provider API key stored as an org credential, and a repo for the agent to work on (any public repo works for a first run).
Step 1: Register. Agents → Register agent → Sandbox Agents. Under How your agent runs, pick Shell command (any CLI).
Step 2: Run command. Click the Claude Code preset chip, it fills in:
claude -p "$(cat $PIPELINES_TASK_FILE)"The platform writes the task brief to $PIPELINES_TASK_FILE and, for a
recognized CLI, appends the headless/approval flags for you (--print for
Claude Code, --yes-always for Aider, --force for Cursor). Call the binary
directly, wrapping the command in bash -c "…" is the one form that isn't
recognized, and it costs you the rich trajectory.
Step 3: Install the CLI in the image. The base image ships no coding CLIs. Under Sandbox environment (advanced) → Base image → Custom Dockerfile, add the install line and leave Build image now checked:
| CLI | Install line |
|---|---|
| Claude Code | RUN sudo npm install -g @anthropic-ai/claude-code |
| Codex | RUN sudo npm install -g @openai/codex |
| Cursor | RUN curl https://cursor.com/install -fsSL | bash |
| Aider | RUN uv tool install --python 3.11 aider-chat |
Step 4: Key. Add an Environment variable row, mode From
credential (the only encrypted path): ANTHROPIC_API_KEY for Claude Code,
OPENAI_API_KEY for Codex, CURSOR_API_KEY for Cursor, the --model
provider's key for Aider. Also raise the Run timeout, the 300 s default
is tight for coding runs (max 1800 s). Save and Publish; wait for the
Custom image chip to reach Ready.
Step 5: Attach a coding setup and seed. A shell-command agent always needs a workspace. In the Pipeline Builder, select the agent on an agent-mode field and open the field's Coding setup popover: pick the Git URL workspace tile and enter your repo's URL, add an optional Setup command (runs before the baseline commit, so it stays out of the graded diff), and Scorers rows if you want mechanical gates. Pick a Judge model, then seed one task:
user
"Fix the failing test in tests/test_parser.py and make the suite pass."Verify, the platform seeds the repo at /home/user/workspace, commits a
baseline, runs your command, and the task page shows all four surfaces: a
Trajectory timeline (Shell steps with output and exit codes, Edit steps as
red/green diffs, Read/Search, Assistant reasoning, it fills in after the
run completes), the Final diff against the baseline, scorer badges,
and the Judge verdict. Full UI tour:
Inspecting runs.
If it fails
| Error | Cause | Fix |
|---|---|---|
FAILED agent_command_failed (in the error detail) | The command exited nonzero and left an empty diff, typically because the CLI waited on an interactive prompt. | Use a recognized binary so headless flags are appended automatically, or pass equivalent headless flags manually. (A nonzero exit with a real diff is not treated as failure; scorers and the judge determine outcome.) |
FAILED in_sandbox_requires_workspace | No coding scenario is attached to the task. | Attach a coding scenario via the Coding setup popover (Step 5), then re-seed. |
FAILED image_not_ready | The custom image is still building. | Wait for Ready status (Step 4), then run again. |
| Trajectory empty but diff present | An unrecognized run command (for example, bash -c "…") triggered coarse fallback capture. | Invoke the CLI binary directly. |
| Banner "Eval phase failed, diff/scorers may be missing" | Diff or scorer capture failed, but the run itself completed and trajectory data is still available. | Expand the banner to inspect the redacted failure cause and re-run if needed. |
8. Seed many coding tasks from CSV
Goal, define the repo, setup, and scorers once on the agent field, then fan a CSV of prompts out into many tasks, each frozen against that definition.
You'll need, a coding agent (runbook 7) and a CSV of task prompts.
Step 1: Define the coding setup. On the agent-mode field, open the Coding setup popover: pick the Workspace source (Git URL / Upload ZIP / Empty), an optional Setup, and Scorers rows. The definition saves onto the field, and every task seeded from the workflow starts from it.
The two Setup modes differ in whether the setup work is graded: Platform runs a command executes the Setup command before the baseline commit, so installs and fixtures never pollute the agent's diff; Agent sets up sends the Setup instructions to the agent, whose setup work does land in the graded diff. Use platform mode unless setting up is itself the task.
On the workspace seed path, a bad git Ref silently falls back to the repo's default branch, the run proceeds with no error (unlike the strict agent-code checkout of runbook 3). Double-check branch/tag/SHA spelling.
Step 2: Upload the CSV. The CSV carries the field's wired seed columns,
user is always required, one prompt per row:
user
"Fix the failing test in tests/test_parser.py."
"Add a --json flag to the CLI entrypoint."
"Refactor parse() to return a dataclass without changing behavior."Each row becomes one task carrying a deep-copied, frozen snapshot of the field's coding setup at creation time. Editing the setup later never changes already-seeded tasks, re-seed to pick up edits.
Step 3: Per-row scenario control (API only). A saved, org-scoped
scenario library lives at /api/coding-scenarios (names unique per org), and
the seeding service accepts per-row CSV axes on top of it: scenario_ref
picks a saved scenario by name per row, and workspace_seed /
setup_command / eval override it per row (a row cell always wins; the
scenario fills gaps). These columns flow only when the workflow's agent field
wires them in agentConfig.odyssey_seed_columns via the workflow API,
the builder UI doesn't expose toggles for them; its UI path is the Coding
setup popover of Step 1.
Verify, one task per CSV row in the Data Explorer, each with the frozen workspace seed, setup, and scorers; open any row to confirm the seed snapshot on the trace tab. Full axis semantics: Coding scenarios.
If it fails
| Error | Cause | Fix |
|---|---|---|
| A row fails with no task created (API path) | A present-but-blank or malformed scenario_ref, workspace_seed, or eval cell triggers a hard-error axis. Coding tasks are not silently downgraded to non-workspace runs. | Correct the malformed cell. Omitting the entire column is valid if that axis is not needed. |
ScenarioResolutionError (API path) | The scenario_ref value is unknown, cross-org, or archived. | Use the exact name of a live scenario in the task's organization. Use GET /api/coding-scenarios?include_archived=true to locate archived entries. |
POST /api/coding-scenarios returns 409 | Scenario name collision occurred because names are unique per organization. | Rename the scenario and retry creation. |
FAILED workspace_seed_failed | The repository could not be cloned or unzipped in the sandbox due to bad URL/host, disallowed scheme, invalid Subdirectory (optional), or archive validation failure. | Correct the workspace seed configuration and run again. |
FAILED workspace_setup_failed | The platform-mode Setup command failed after repository seeding. | Reproduce the setup command locally against the same repository, fix the failure, then re-run. |