Experiments

AgentV eval files are the runnable authoring artifact. Use top-level description for display metadata, tags.experiment as the run/result grouping label, target for the system under test, and flat top-level run controls such as timeout_seconds and threshold. Use evaluate_options for evaluation runtime options such as repeat, budget_usd, and max_concurrency. Use agentv eval --workers N or project config defaults such as agentv.config.* / .agentv/config.yaml execution.workers for operator-side overrides.

name: support-regression
description: Support regression suite
tags:
  experiment: support-codex
target:
  extends: codex-gpt5
  model: gpt-5.1
  reasoning_effort: high
timeout_seconds: 720
evaluate_options:
  repeat:
    count: 4
    strategy: pass_any
  budget_usd: 2.00
  max_concurrency: 3

workspace:
  hooks:
    before_all:
      command: ["bash", "-lc", "bun install && bun run build"]

tests:
  - id: refund-eligibility
    input: Can this customer get a refund?
    criteria: Applies the refund policy correctly

Layout Conventions

Use directories for human organization, not schema behavior. A common layout is:

evals/
  suites/
    refunds.eval.yaml
  cases/
    refund-smoke.cases.yaml
experiments/
  refunds-codex.eval.yaml

In that layout, evals/suites/refunds.eval.yaml is a reusable task suite, evals/cases/refund-smoke.cases.yaml is raw case data, and experiments/refunds-codex.eval.yaml is a wrapper eval. The wrapper still runs only because it is eval YAML:

name: refunds-codex
target: codex-gpt5

tests:
  - id: local-edge-case
    input: Check a damaged final-sale refund.

imports:
  suites:
    - path: ../evals/suites/refunds.eval.yaml
  tests:
    - path: ../evals/cases/refund-smoke.cases.yaml

The experiments/ folder is optional and user-owned. AgentV does not scan it for special files or infer runtime behavior from the path; the same wrapper eval could live under evals/wrappers/, benchmarks/, or beside the suite it runs.

Suite And Test Imports

Use imports.suites for full child suites and imports.tests for raw test rows. Inline tests remain raw cases owned by the current file.

imports:
  suites:
    - path: evals/support/*.eval.yaml
      select:
        test_ids:
          - refund-*
          - missing-order-date
        tags: regression
        metadata:
          priority: high
      run:
        threshold: 1.0
        timeout_seconds: 300
  tests:
    - path: cases/*.cases.yaml
    - path: cases/regression.jsonl

tests:
  - cases/smoke/*.cases.yaml

imports.suites preserves the imported suite’s task contract: metadata, workspace, shared input, shared assertions, and tests. The parent eval still owns the single run bundle and run controls. Use parent target and top-level run controls for the overall run, and import run: for scoped threshold, timeout, or budget overrides.

A parent eval that imports any imports.suites entry must not define top-level workspace. Imported suites own task environment. If the parent should provide workspace context, import raw cases with imports.tests or shorthand paths instead of importing an eval suite.

imports.tests imports only raw test entries. It intentionally drops shared context from an imported eval suite, so parent suite fields apply to those raw cases.

Import select.test_ids filters imported test IDs with glob patterns. Import select.tags filters each imported case’s effective metadata.tags. Effective case tags are suite-first and deduped: suite.tags + suite.metadata.tags + test.metadata.tags. Top-level suite tags still remain suite identity metadata for discovery and reporting; selection reads the merged case metadata view. Import select.metadata filters case metadata by key/value, where selector values may be scalars or lists. Globbed include paths are resolved in deterministic path order, then test order.

String-valued tests and string entries inside tests[] are raw-case import shorthand. They are equivalent to imports.tests and may point at raw case files, directories, or globs. Importing another eval suite must use imports.suites.

Suite imports are resolved as a deterministic include graph. Circular imports.suites imports fail validation with the import chain; raw-case shorthand does not recursively load suite runtime blocks.

Imported suite rows keep their source suite metadata in index.jsonl. Use each row’s result_dir as the authoritative path to generated artifacts inside the run directory; do not infer layout from suite names.

Scoped Run Overrides

Use scoped run: blocks for result interpretation and scheduling policies that vary by include group or test case. Precedence is:

test.run > import run > parent top-level run controls

target: agent
threshold: 0.8
evaluate_options:
  repeat:
    count: 3
    strategy: pass_any

imports:
  suites:
    - path: ./evals/flaky-agentic/**/*.eval.yaml
      select:
        tags: [agentic]
      run:
        timeout_seconds: 300

    - path: ./evals/regression/**/*.eval.yaml
      select:
        tags: [must-pass]
      run:
        threshold: 1.0
        timeout_seconds: 300

tests:
  - id: critical-case
    input: "..."
    criteria: Must pass exactly
    run:
      threshold: 1.0
      budget_usd: 0.50

Scoped run: supports threshold, repeat, timeout_seconds, and legacy per-case budget_usd overrides. Parent suite budgets should use evaluate_options.budget_usd for public eval authoring. Use evaluate_options.max_concurrency for authored concurrency. Candidate-changing fields stay parent-level. Executable workspace setup belongs in top-level lifecycle extensions, and provider-specific setup belongs in target configuration.

Lifecycle Ownership

Run controls do not own commands that prepare files, dependencies, repos, or target-specific runner state.

Need	Put it in
Install dependencies, build the repo, seed files	`extensions: ["file://scripts/setup.mjs:beforeAll"]`
Apply per-case state	`extensions: ["file://scripts/setup.mjs:beforeEach"]`
Reset file state after each case	`workspace.hooks.after_each.reset`
Configure an agent runner or provider variant	`target` object or `targets.yaml`
Choose the target	top-level `target`
Override the target’s default model	`target.model`
Configure repeat policy, budget, concurrency, timeout, threshold	`evaluate_options.repeat`, `evaluate_options.budget_usd`, `evaluate_options.max_concurrency`, `timeout_seconds`, `threshold`
Bind an existing local workspace directory	`--workspace-path` or `.agentv/config.local.yaml`

extensions:
  - file://scripts/build.mjs:beforeAll

target:
  extends: codex-gpt5
  hooks:
    before_each:
      command: ["sh", "-c", "cp -R skills \"{{workspace_path}}/.codex/skills\""]
evaluate_options:
  repeat:
    count: 3
    strategy: pass_any

Existing local workspace paths are machine-local bindings: pass --workspace-path for a one-off run or put execution.workspace_path in .agentv/config.local.yaml. Put repos, templates, hooks, Docker config, env checks, and isolation under top-level or case-level workspace.

Repeat Runs

Use evaluate_options.repeat when you want AgentV to try each case more than once:

evaluate_options:
  repeat: 3

Use object form when you need richer AgentV behavior:

evaluate_options:
  repeat:
    count: 3
    strategy: pass_any
    early_exit: true
    cost_limit_usd: 1.00

evaluate_options.repeat.strategy controls verdict aggregation. pass_any treats the case as successful when any completed attempt passes; pass_all requires every completed attempt to pass. mean and confidence_interval aggregate scores where supported today. evaluate_options.repeat.early_exit is only a scheduling and cost optimization: pass_any may stop at the first pass, and pass_all may stop at the first fail. Leave it unset or false when you want complete variance data. Per-case tests[].options.repeat overrides the global repeat count or object for that case.

Result Layout

Eval runs write to a direct run bundle:

.agentv/results/<run_id>/

CLI --experiment sets the experiment label explicitly. Without that flag, AgentV uses the reserved tags.experiment key (see below), then the suite name, then the eval filename. The precedence is --experiment > tags.experiment > default. There is no top-level experiment field — a run is labeled with tags.experiment. The Dashboard uses “Experiment” for the comparison and result grouping concept; folder names are only storage allocation and must not define result semantics.

Tags as run metadata (`tags.experiment`)

Suite-level tags accepts either the existing selection form (a string or list of strings that drives select.tags / --tag name filtering) or a promptfoo-shaped map:

tags:
  experiment: baseline-v2
  team: compliance

The map form is run metadata, not selection. The reserved experiment key feeds the experiment namespace, and the full map is emitted to summary.json.metadata.tags and every index.jsonl row so the Dashboard can group trend/compare views by tags.experiment.

Set or override map tags from the CLI with a repeatable --tag key=value flag (--tag experiment=baseline-v2 --tag team=compliance); bare --tag name keeps its existing file-selection meaning. Tags merge with precedence CLI --tag key=value > project config tags > eval tags. --experiment still wins over tags.experiment for the namespace, and an explicit --tag experiment= clears the label back to the default.

Imported source suite metadata appears in index.jsonl rows and manifests. Use index.jsonl fields such as eval_path, test_id, target, and result_dir for identity and artifact discovery instead of reconstructing paths from suite names or wrapper layout.

For the complete result file contract, including why row metadata is semantic truth and directories are storage allocation, see Result Artifact Contract.