Evaluations Overview

The built-in LLM judge already renders the primary pass/fail verdict on each agent run — see Inspecting runs. Evaluations let you layer your own additional scoring dimensions on top of those agent outputs (and on any field output). Scores can come from a human evaluating an agent trace, an LLM acting as a judge, or an automated programmatic check (regex, keyword, JSON validity, etc.).

The building blocks

Criterion — a single, reusable scoring metric (e.g. Factual accuracy, 1–5 or Response is valid JSON). Criteria live in the Evaluations → Criteria library and are the atomic unit: every evaluation result in the system is tied to exactly one criterion at a specific version.
Evaluator — a criterion that has been attached to a specific node so it scores that node's agent output. This is what actually produces a result for a task. Evaluators are configured in one of two places:
- Inline, via the Form Builder — surfaced as a form field a human reviewer sees and acts on.
- Hidden, via the Evaluators panel — run server-side, invisible to the reviewer.

Inline vs. Hidden evaluators

Both kinds produce evaluation results attached to tasks; the difference is who sees the evaluator while reviewing an agent's output.

	Inline evaluator	Hidden evaluator
Where configured	Form Builder (as a field)	Evaluators panel
Visible to a human reviewer	Yes (part of the form)	No (runs in the background)
Supported criterion types	Human Rating, LLM Judge, Programmatic	LLM Judge, Programmatic only (not Human Rating)
When to use	You want a reviewer to score it themselves or see the score, or the agent output under evaluation is also something the reviewer should rate or re-run	You want automated scoring that the reviewer should not see or be influenced by

You can mix both on the same node — for example, a reviewer fills in an inline human rating while a hidden LLM judge scores the same agent response in the background.

Where things live in the UI

Thing	How to get there
Criteria library	Sidebar → Evaluations → Criteria tab
Attach an inline evaluator	Pipeline Builder → select a node → open the Form Builder, add a field (see Running evaluations)
Attach a hidden evaluator	Pipeline Builder → select a node → Evaluators panel → Add
Trigger a manual evaluator on existing tasks	Pipeline → Data Explorer → select rows → Evaluate button
View results per task	Data Explorer table, or click View on a row to open the Task Detail panel
View aggregate analytics	Pipeline → Data Explorer → Evaluation Analytics tab

End-to-end flow

Create the criteria you need in the Criteria tab, or reuse existing ones.
Open the Pipeline Builder, select the node whose agent output you want to score, and attach evaluators — either inline via the Form Builder or hidden via the Evaluators panel. See Running evaluations for the three ways to make a field evaluative.
For hidden evaluators only, choose a Trigger:
- On Submit — runs automatically when the node is submitted.
- Manual — only runs when triggered from the Data Explorer.
Inline evaluative fields (Criteria fields and toggled-evaluative form fields) do not have a trigger selector — they always execute inline as part of the node's lifecycle.
Run the agent and submit tasks. Inline and On-Submit evaluators produce results as the node is submitted (inline fields populate immediately; hidden On-Submit evaluators queue as a background job).
For Manual hidden evaluators, open the Data Explorer, select tasks, click Evaluate, and confirm in the dialog.
View results in the Data Explorer table (per-task) and the Evaluation Analytics tab (aggregate charts and breakdowns).

Versioning

Criteria are versioned. A new criterion version is created when its display label, config, or output schema changes on save; name- or description-only edits do not bump the version. Pipelines that reference the criterion do not auto-update — they keep running against the version they were pinned to until the pipeline is edited.

Existing evaluation results stay attached to the version that produced them, so historical data is never silently rewritten.

The building blocks

Inline vs. Hidden evaluators

Where things live in the UI

End-to-end flow

Versioning

On this page