Scorers and grading

A coding run is graded immediately after agent execution stops. Unlike a ledger run, which is evaluated only by the judge, a run with a code workspace is also evaluated by scorers. Scorers are deterministic programs that run in the live sandbox and inspect the actual repository changes.

How a coding run is graded

agent finishes
  → scorers run in the live sandbox against the workspace
  → the LLM judge scores the run, reading the trajectory, the diff,
    and the scorer results as evidence
  → results land on the run

The agent finishes. Trajectory and final response are saved, and the platform captures the cumulative git diff of the workspace versus the seeded baseline.
Scorers run. Each scorer executes in the still-live sandbox, runs test commands, reads the diff, and checks changed files. Scorers run on the full working tree.
The judge scores the run. It reads the task, trajectory, diff, the test results, and the scorer results, then returns a PASS/FAIL rubric.
Results are written to the run and shown as three independent gates.

Scorers run only for runs with a code workspace. Ledger and no-workspace runs do not enter the scorer engine and are evaluated only by the judge.

The scorer catalog

A scorer has name, type, required, and type-specific fields. Each scorer yields a per-scorer PASS, FAIL, or N/A verdict and a 0.0 to 1.0 score. A scorer that does not apply, for example a test-content check when no test files changed, returns N/A and neither passes nor fails.

Type	Detects	Config	Required by default
Command (command)	Shell command exits non-zero. Primary test command is typically this type.	command (string), timeout_s (1 to 3600, default 900)	required
Allowed paths (allowed_paths)	Changed file is outside allowed glob patterns.	patterns (non-empty glob list)	required
Forbid paths (forbid_paths)	Changed file matches forbidden glob patterns, for example .github or secrets paths.	patterns (non-empty glob list)	required
Max files changed (max_files_changed)	Number of changed files exceeds configured limit.	limit (non-negative integer)	required
Forbid secrets (forbid_secrets)	Secret-like pattern appears in diff. This is a tripwire, not a complete scanner.	none	required
File exists (file_exists)	Required relative path is missing after run.	path (relative path, no parent traversal)	required
Tests unmodified (tests_unmodified)	Graded test file was edited.	paths (non-empty list of exact paths)	required
Baseline unmodified (baseline_unmodified)	Baseline or scaffolding file was edited.	paths (non-empty list of exact paths)	required
No new skips (no_new_skips)	Net new skip or xfail markers were added to test files.	test_globset (optional glob list)	advisory
Assertions not weakened (assertions_not_weakened)	Assertions were net-removed from test files.	test_globset (optional glob list)	advisory
LLM judge (llm_judge)	Non-mechanical rubric note passed to judge as additional context.	rubric (string)	advisory only

Path patterns use shell-style fnmatch globs, not gitignore syntax. A single wildcard does not stop at slash boundaries, so src/* also matches src/deep/file.py. Use src/** when the intent is all descendants of src.

Reward-hacking detectors

Four scorers specifically detect reward hacking behavior, where the agent modifies grading checks instead of solving the task:

Tests unmodified, fails if the agent edits a graded test file.
Baseline unmodified, fails if the agent edits frozen scaffolding.
No new skips, fails if the agent adds skip or xfail markers to avoid failing tests.
Assertions not weakened, fails if assertions are removed from test bodies.

The two heuristic detectors, No new skips and Assertions not weakened, evaluate line diffs on changed test files. By design they are approximate and default to advisory so signals are visible without failing the run unless explicitly marked required. Tests unmodified and Baseline unmodified rely on exact path checks against changed-file lists and therefore default to required. Scope scorers, Allowed paths, Forbid paths, and Max files changed, also constrain where and how much an agent may modify.

Required vs advisory

The required flag determines whether scorer failure affects workspace verdict:

Required scorers gate the verdict. If any required applicable scorer fails, workspace verdict is FAIL.
Advisory scorers still execute and report with advisory tag, but their failure does not fail the run.

Only No new skips and Assertions not weakened default to advisory. All other mechanical scorers default to required. For llm_judge scorers, required is non-gating and only affects judge context input.

Configuring scorers

Scorers are configured on a coding scenario, not per task. In the scenario editor, each scorer is one row with:

Name, a label for the scorer.
Type, selected from the catalog. Each type exposes its own configuration fields, including command and timeout, patterns, limit, path, paths, rubric, and optional test glob set as appropriate.
Required toggle, which controls required versus advisory behavior and initializes from type default.

Scorers freeze at seed time. When a task is created, the scenario scorers are snapshotted into task seed. Later scenario edits do not alter grading of already-seeded tasks. Re-grade with new scorers by creating new tasks. See Coding scenarios.

The judge on coding runs

The judge scores every completed run on a fixed rubric with Task completion, Instruction adherence, and Efficiency on a 1 to 5 scale, plus overall PASS or FAIL verdict, failure_mode label on FAIL, and reasoning. For coding runs, the judge also receives:

Workspace diff, cumulative git diff versus baseline, truncated for very large changes).
Objective test result and command that produced it.
Scorer results summarized as a table.
Additional rubric notes from llm_judge scorers.
Command output (stdout / stderr tails and exit code) for CLI agents.

The expected_outcome oracle applies the same way as ledger runs. refusal indicates that refusal is the correct outcome. completion, the default, evaluates literal completion of the task request. PASS is determined by the verdict field. Judge verdict is a separate signal from mechanical scorers and does not gate scorer results.

The judge is a stochastic LLM call. Passing objective tests are treated as strong completion evidence, but verdicts can still vary across runs. Judge LLM failures are recorded as unparseable verdicts and do not fail the run.

Reading results

A coding run surfaces three independent gates. No gate subsumes another.

Gate	Where it appears	Means
Objective and checks	OVERALL badge in Scorers card, plus per-scorer rows.	Whether required mechanical scorers passed against workspace.
Judge verdict	Judge verdict card, and Judge verdict column in Data Explorer.	Whether semantic task outcome is correct.
Run badge	Run status in run list.	Whether run completed and passed generic run metrics.

Open the Agent Trace tab to inspect all three:

Scorers card, one row per scorer with PASS, FAIL, or N/A status, scorer type, advisory marker when applicable, and score. Rows expand to detail and output tail. Hard failures start expanded. Card header shows OVERALL verdict and mean score. Card is hidden for ledger runs.
Judge verdict card, with PASS or FAIL status, rubric score tiles, fact-check matrix, failure_mode label, reasoning, judge model, and scored-at footprint.

When the gates disagree

Gate disagreement is expected and informative:

Judge PASS and scorers FAIL indicates semantically plausible output that violated a hard mechanical constraint, such as forbidden paths, file-change limits, or test file mutation.
Scorers PASS and judge FAIL indicates mechanical checks passed but semantic completion was not convincing.
Both PASS with advisory scorer FAIL indicates a heuristic warning that does not fail run status but should be reviewed.

For multi-turn coding sessions, the same scorers run once at session end and the session judge scores the full transcript. See Multi-turn testing.

On this page