Skip to content

Experiments

AgentV eval files are the runnable authoring artifact. Use top-level description for display metadata, tags.experiment as the run/result grouping label, target for the system under test, and flat top-level run controls such as timeout_seconds and threshold. Use evaluate_options for evaluation runtime options such as repeat, budget_usd, and max_concurrency. Use agentv eval --workers N or project config defaults such as agentv.config.* / .agentv/config.yaml execution.workers for operator-side overrides.

name: support-regression
description: Support regression suite
tags:
experiment: support-codex
target:
extends: codex-gpt5
model: gpt-5.1
reasoning_effort: high
timeout_seconds: 720
evaluate_options:
repeat:
count: 4
strategy: pass_any
budget_usd: 2.00
max_concurrency: 3
workspace:
hooks:
before_all:
command: ["bash", "-lc", "bun install && bun run build"]
tests:
- id: refund-eligibility
input: Can this customer get a refund?
criteria: Applies the refund policy correctly

Use directories for human organization, not schema behavior. A common layout is:

evals/
suites/
refunds.eval.yaml
cases/
refund-smoke.cases.yaml
experiments/
refunds-codex.eval.yaml

In that layout, evals/suites/refunds.eval.yaml is a reusable task suite, evals/cases/refund-smoke.cases.yaml is raw case data, and experiments/refunds-codex.eval.yaml is a wrapper eval. The wrapper still runs only because it is eval YAML:

experiments/refunds-codex.eval.yaml
name: refunds-codex
target: codex-gpt5
tests:
- id: local-edge-case
input: Check a damaged final-sale refund.
imports:
suites:
- path: ../evals/suites/refunds.eval.yaml
tests:
- path: ../evals/cases/refund-smoke.cases.yaml

The experiments/ folder is optional and user-owned. AgentV does not scan it for special files or infer runtime behavior from the path; the same wrapper eval could live under evals/wrappers/, benchmarks/, or beside the suite it runs.

Use imports.suites for full child suites and imports.tests for raw test rows. Inline tests remain raw cases owned by the current file.

imports:
suites:
- path: evals/support/*.eval.yaml
select:
test_ids:
- refund-*
- missing-order-date
tags: regression
metadata:
priority: high
run:
threshold: 1.0
timeout_seconds: 300
tests:
- path: cases/*.cases.yaml
- path: cases/regression.jsonl
tests:
- cases/smoke/*.cases.yaml

imports.suites preserves the imported suite’s task contract: metadata, workspace, shared input, shared assertions, and tests. The parent eval still owns the single run bundle and run controls. Use parent target and top-level run controls for the overall run, and import run: for scoped threshold, timeout, or budget overrides.

A parent eval that imports any imports.suites entry must not define top-level workspace. Imported suites own task environment. If the parent should provide workspace context, import raw cases with imports.tests or shorthand paths instead of importing an eval suite.

imports.tests imports only raw test entries. It intentionally drops shared context from an imported eval suite, so parent suite fields apply to those raw cases.

Import select.test_ids filters imported test IDs with glob patterns. Import select.tags filters each imported case’s effective metadata.tags. Effective case tags are suite-first and deduped: suite.tags + suite.metadata.tags + test.metadata.tags. Top-level suite tags still remain suite identity metadata for discovery and reporting; selection reads the merged case metadata view. Import select.metadata filters case metadata by key/value, where selector values may be scalars or lists. Globbed include paths are resolved in deterministic path order, then test order.

String-valued tests and string entries inside tests[] are raw-case import shorthand. They are equivalent to imports.tests and may point at raw case files, directories, or globs. Importing another eval suite must use imports.suites.

Suite imports are resolved as a deterministic include graph. Circular imports.suites imports fail validation with the import chain; raw-case shorthand does not recursively load suite runtime blocks.

Imported suite rows keep their source suite metadata in index.jsonl. Use each row’s result_dir as the authoritative path to generated artifacts inside the run directory; do not infer layout from suite names.

Use scoped run: blocks for result interpretation and scheduling policies that vary by include group or test case. Precedence is:

test.run > import run > parent top-level run controls
target: agent
threshold: 0.8
evaluate_options:
repeat:
count: 3
strategy: pass_any
imports:
suites:
- path: ./evals/flaky-agentic/**/*.eval.yaml
select:
tags: [agentic]
run:
timeout_seconds: 300
- path: ./evals/regression/**/*.eval.yaml
select:
tags: [must-pass]
run:
threshold: 1.0
timeout_seconds: 300
tests:
- id: critical-case
input: "..."
criteria: Must pass exactly
run:
threshold: 1.0
budget_usd: 0.50

Scoped run: supports threshold, repeat, timeout_seconds, and legacy per-case budget_usd overrides. Parent suite budgets should use evaluate_options.budget_usd for public eval authoring. Use evaluate_options.max_concurrency for authored concurrency. Candidate-changing fields stay parent-level. Executable workspace setup belongs in top-level lifecycle extensions, and provider-specific setup belongs in target configuration.

Run controls do not own commands that prepare files, dependencies, repos, or target-specific runner state.

NeedPut it in
Install dependencies, build the repo, seed filesextensions: ["file://scripts/setup.mjs:beforeAll"]
Apply per-case stateextensions: ["file://scripts/setup.mjs:beforeEach"]
Reset file state after each caseworkspace.hooks.after_each.reset
Configure an agent runner or provider varianttarget object or targets.yaml
Choose the targettop-level target
Override the target’s default modeltarget.model
Configure repeat policy, budget, concurrency, timeout, thresholdevaluate_options.repeat, evaluate_options.budget_usd, evaluate_options.max_concurrency, timeout_seconds, threshold
Bind an existing local workspace directory--workspace-path or .agentv/config.local.yaml
extensions:
- file://scripts/build.mjs:beforeAll
target:
extends: codex-gpt5
hooks:
before_each:
command: ["sh", "-c", "cp -R skills \"{{workspace_path}}/.codex/skills\""]
evaluate_options:
repeat:
count: 3
strategy: pass_any

Existing local workspace paths are machine-local bindings: pass --workspace-path for a one-off run or put execution.workspace_path in .agentv/config.local.yaml. Put repos, templates, hooks, Docker config, env checks, and isolation under top-level or case-level workspace.

Use evaluate_options.repeat when you want AgentV to try each case more than once:

evaluate_options:
repeat: 3

Use object form when you need richer AgentV behavior:

evaluate_options:
repeat:
count: 3
strategy: pass_any
early_exit: true
cost_limit_usd: 1.00

evaluate_options.repeat.strategy controls verdict aggregation. pass_any treats the case as successful when any completed attempt passes; pass_all requires every completed attempt to pass. mean and confidence_interval aggregate scores where supported today. evaluate_options.repeat.early_exit is only a scheduling and cost optimization: pass_any may stop at the first pass, and pass_all may stop at the first fail. Leave it unset or false when you want complete variance data. Per-case tests[].options.repeat overrides the global repeat count or object for that case.

Eval runs write to a direct run bundle:

.agentv/results/<run_id>/

CLI --experiment sets the experiment label explicitly. Without that flag, AgentV uses the reserved tags.experiment key (see below), then the suite name, then the eval filename. The precedence is --experiment > tags.experiment > default. There is no top-level experiment field — a run is labeled with tags.experiment. The Dashboard uses “Experiment” for the comparison and result grouping concept; folder names are only storage allocation and must not define result semantics.

Suite-level tags accepts either the existing selection form (a string or list of strings that drives select.tags / --tag name filtering) or a promptfoo-shaped map:

tags:
experiment: baseline-v2
team: compliance

The map form is run metadata, not selection. The reserved experiment key feeds the experiment namespace, and the full map is emitted to summary.json.metadata.tags and every index.jsonl row so the Dashboard can group trend/compare views by tags.experiment.

Set or override map tags from the CLI with a repeatable --tag key=value flag (--tag experiment=baseline-v2 --tag team=compliance); bare --tag name keeps its existing file-selection meaning. Tags merge with precedence CLI --tag key=value > project config tags > eval tags. --experiment still wins over tags.experiment for the namespace, and an explicit --tag experiment= clears the label back to the default.

Imported source suite metadata appears in index.jsonl rows and manifests. Use index.jsonl fields such as eval_path, test_id, target, and result_dir for identity and artifact discovery instead of reconstructing paths from suite names or wrapper layout.

For the complete result file contract, including why row metadata is semantic truth and directories are storage allocation, see Result Artifact Contract.