Evaluations Overview
Layer custom scoring dimensions on top of agent outputs with human ratings, LLM judges, and programmatic checks.
The built-in LLM judge already renders the primary pass/fail verdict on each agent run — see Inspecting runs. Evaluations let you layer your own additional scoring dimensions on top of those agent outputs (and on any field output). Scores can come from a human evaluating an agent trace, an LLM acting as a judge, or an automated programmatic check (regex, keyword, JSON validity, etc.).
The building blocks
- Criterion — a single, reusable scoring metric (e.g. Factual accuracy, 1–5 or Response is valid JSON). Criteria live in the Evaluations → Criteria library and are the atomic unit: every evaluation result in the system is tied to exactly one criterion at a specific version.
- Evaluator — a criterion that has been attached to a specific node so it scores that node's agent output. This is what actually produces a result for a task. Evaluators are configured in one of two places:
- Inline, via the Form Builder — surfaced as a form field a human reviewer sees and acts on.
- Hidden, via the Evaluators panel — run server-side, invisible to the reviewer.
Inline vs. Hidden evaluators
Both kinds produce evaluation results attached to tasks; the difference is who sees the evaluator while reviewing an agent's output.
| Inline evaluator | Hidden evaluator | |
|---|---|---|
| Where configured | Form Builder (as a field) | Evaluators panel |
| Visible to a human reviewer | Yes (part of the form) | No (runs in the background) |
| Supported criterion types | Human Rating, LLM Judge, Programmatic | LLM Judge, Programmatic only (not Human Rating) |
| When to use | You want a reviewer to score it themselves or see the score, or the agent output under evaluation is also something the reviewer should rate or re-run | You want automated scoring that the reviewer should not see or be influenced by |
You can mix both on the same node — for example, a reviewer fills in an inline human rating while a hidden LLM judge scores the same agent response in the background.
Where things live in the UI
| Thing | How to get there |
|---|---|
| Criteria library | Sidebar → Evaluations → Criteria tab |
| Attach an inline evaluator | Pipeline Builder → select a node → open the Form Builder, add a field (see Running evaluations) |
| Attach a hidden evaluator | Pipeline Builder → select a node → Evaluators panel → Add |
| Trigger a manual evaluator on existing tasks | Pipeline → Data Explorer → select rows → Evaluate button |
| View results per task | Data Explorer table, or click View on a row to open the Task Detail panel |
| View aggregate analytics | Pipeline → Data Explorer → Evaluation Analytics tab |
End-to-end flow
-
Create the criteria you need in the Criteria tab, or reuse existing ones.
-
Open the Pipeline Builder, select the node whose agent output you want to score, and attach evaluators — either inline via the Form Builder or hidden via the Evaluators panel. See Running evaluations for the three ways to make a field evaluative.
-
For hidden evaluators only, choose a Trigger:
- On Submit — runs automatically when the node is submitted.
- Manual — only runs when triggered from the Data Explorer.
Inline evaluative fields (Criteria fields and toggled-evaluative form fields) do not have a trigger selector — they always execute inline as part of the node's lifecycle.
-
Run the agent and submit tasks. Inline and On-Submit evaluators produce results as the node is submitted (inline fields populate immediately; hidden On-Submit evaluators queue as a background job).
-
For Manual hidden evaluators, open the Data Explorer, select tasks, click Evaluate, and confirm in the dialog.
-
View results in the Data Explorer table (per-task) and the Evaluation Analytics tab (aggregate charts and breakdowns).
Versioning
Criteria are versioned. A new criterion version is created when its display label, config, or output schema changes on save; name- or description-only edits do not bump the version. Pipelines that reference the criterion do not auto-update — they keep running against the version they were pinned to until the pipeline is edited.
Existing evaluation results stay attached to the version that produced them, so historical data is never silently rewritten.