Skip to content

Result Artifact Contract

AgentV writes each eval invocation as a portable run bundle. The bundle is the source of truth for Dashboard, reports, compare/trend tooling, CI gates, and external adapters.

The contract is run-centric:

  • summary.json owns aggregate run facts.
  • index.jsonl owns row-level discovery and filtering.
  • Per-case sidecars own detailed payloads such as grading, metrics, transcripts, timing, generated files, and raw provider evidence.
  • Dashboard, search, SQLite, HTML reports, and vendor exports are rebuildable projections over the bundle.

The default local layout is:

.agentv/results/
<run_id>/
summary.json
index.jsonl
tags.json # optional mutable Dashboard tags
<case-or-allocation>/
summary.json # optional per-case aggregate, especially repeats
test/ # optional generated test bundle
EVAL.yaml
targets.yaml
files/
graders/
attempt-1/
result.json
grading.json
metrics.json
timing.json
transcript.json
transcript-raw.jsonl
outputs/
answer.md
file_changes.diff
attempt-2/
result.json
grading.json
metrics.json
timing.json
transcript.json
transcript-raw.jsonl
outputs/
answer.md
file_changes.diff
.indexes/ # reserved rebuildable/local indexes
.cache/ # reserved local cache

<run_id> is the only committed run-bundle path identity. It helps AgentV put completed runs somewhere predictable, but readers must not infer semantic truth from folder names. Use fields in summary.json and index.jsonl for experiment, target, variant, attempt, eval path, case identity, timing, scores, and artifact paths.

The run bundle does not add target, model, variant, or cases/ folders below <run_id>. Per-result directories are allocated from row identity, usually with a readable test-id or slug prefix plus a short hash suffix, and remain opaque to consumers.

experiment is metadata: it is how users label a condition such as baseline, candidate, with_skills, or without_skills. It is recorded in summary.json and rows, not as a parent directory and not as a runtime-policy object. If a bundle is copied, combined, published, or imported under a different directory, its metadata still carries the facts consumers should query.

Top-level dot-prefixed directories such as .indexes/ and .cache/ are reserved for rebuildable local state and are skipped by run discovery.

File or fieldOwnsUse it for
summary.jsonAggregate run metadata and rollups: run id, experiment metadata, counts, pass rate, score summaries, duration, token/cost totals, and writer metadata.Listing runs, CI summaries, quick dashboards, trend cards, and validating that a run is complete enough to inspect.
index.jsonlCanonical row index: one row per result, attempt, or case-level aggregate, with identity fields, filter metadata, scores, status, and explicit run-relative paths to sidecars.Filtering, compare/trend inputs, Dashboard detail routing, rerun/resume lookup, export adapters, and artifact discovery.
result.jsonCompact per-attempt manifest for one attempt directory, including AgentV execution_status and verdict.Loading one attempt without scanning the whole run index.
grading.jsonGrader outputs, assertions, rubric evidence, execution-metric grader facts, and scoring provenance.Explaining why a row passed or failed.
metrics.jsonDerived executor behavior summary, such as tool calls, files touched, shell commands, errors, turns, and output sizes.Dashboard behavior views, metric-style graders, adapter projections, and lightweight analysis.
outputs/file_changes.diffFull unified diff of workspace file changes when file changes are captured.Human review and external artifact inspection; LLM and code graders still receive the same full diff through file_changes.
timing.jsonDuration, token usage, cost usage, and source labels such as provider_reported, token_estimated, aggregate, or unavailable.Cost/latency reporting and provider-accounting audits.
transcript.jsonAgentV-normalized transcript/timeline document with canonical tool_name values and transcript_summary.Portable human review, transcript-aware graders, and tool-trajectory analysis.
transcript-raw.jsonlNative provider or harness evidence when available.Parser debugging, forensic review, and preserving source bytes without making provider schemas public AgentV fields.
test/Generated test bundle for the exact eval slice and target settings that produced a row.Audit, external review, and rerun workflows that should not depend on a mutable source checkout.
artifact_pointersOffload indirection for large detached payload bytes.Finding payloads published outside the primary metadata/control-plane branch, such as transcript bytes on agentv/artifacts/v1.

summary.json and index.jsonl are complementary, not redundant. A run list should not scan every row just to show pass rate or total duration, and a row reader should not parse aggregate summary structures to find one case’s grading or transcript. Keep aggregate questions on summary.json; keep row and artifact discovery on index.jsonl.

Each index.jsonl line is a JSON object. The exact field set grows as AgentV adds providers and projections, but stable rows follow these rules:

  • Field names are snake_case.
  • Identity and filter fields live on the row, not only in directory names.
  • Sidecar references are explicit path fields, relative to the run directory.
  • Large detached payloads may also have artifact_pointers, but ordinary sidecars should still be discoverable through path fields.
  • Unknown fields should be preserved by adapters when they rewrite or project rows.

Example row:

{
"timestamp": "2026-06-30T08:15:00.000Z",
"run_id": "2026-06-30T08-15-00-000Z",
"experiment": "with_skills",
"tags": { "experiment": "with_skills", "team": "support" },
"eval_path": "evals/support/refunds.eval.yaml",
"test_id": "refund-eligibility",
"target": "codex-gpt5",
"variant": "skills-v2",
"attempt": 1,
"execution_status": "ok",
"score": 0.92,
"duration_ms": 184200,
"result_dir": "refund-eligibility--4f9a7c2d1b6e",
"summary_path": "refund-eligibility--4f9a7c2d1b6e/summary.json",
"grading_path": "refund-eligibility--4f9a7c2d1b6e/attempt-1/grading.json",
"metrics_path": "refund-eligibility--4f9a7c2d1b6e/attempt-1/metrics.json",
"timing_path": "refund-eligibility--4f9a7c2d1b6e/attempt-1/timing.json",
"transcript_path": "refund-eligibility--4f9a7c2d1b6e/attempt-1/transcript.json",
"transcript_raw_path": "refund-eligibility--4f9a7c2d1b6e/attempt-1/transcript-raw.jsonl",
"transcript_summary": {
"total_turns": 4,
"tool_calls": { "file_read": 2, "shell": 1, "unknown": 0 },
"files_read": ["src/refunds.ts"],
"files_modified": ["src/refunds.ts"],
"shell_commands": ["bun test refunds.test.ts"],
"web_fetches": [],
"errors": [],
"thinking_blocks": 1
},
"output_path": "refund-eligibility--4f9a7c2d1b6e/attempt-1/outputs/answer.md",
"answer_path": "refund-eligibility--4f9a7c2d1b6e/attempt-1/outputs/answer.md",
"file_changes_path": "refund-eligibility--4f9a7c2d1b6e/attempt-1/outputs/file_changes.diff",
"test_dir": "refund-eligibility--4f9a7c2d1b6e/test"
}

Rows can represent repeated attempts, multi-target runs, imported suites, manual prepare/grade attempts, or imported provider sessions. That is why experiment, eval_path, test_id, target, variant, attempt, and source metadata belong in index.jsonl: tools can filter dynamically without requiring every run to be pre-split into semantic folders.

When a run resolves a promptfoo-shaped tags map (from suite tags, project config tags, or --tag key=value), the resolved map is emitted as tags on each row and as summary.json.metadata.tags. Its reserved experiment key matches the row experiment field, so trend/compare views can group by tags.experiment.

Use repeat for authoring configuration and attempts for produced executions. The attempt-1/, attempt-2/, and later folders under a result directory are artifact folders for those produced executions. Do not treat those folder names as the comparison dimension. Repeated stochastic samples should be represented by explicit metadata such as sample_index and sample_count; infrastructure retries should use retry metadata such as retry_index, retry_count, and retry_reason when available.

Consumers should read a bundle in this order:

  1. Resolve the run directory from either a directory path or an index.jsonl path.
  2. Load summary.json for aggregate metadata and run-level display.
  3. Stream index.jsonl for row identity, filters, status, scores, and sidecar paths.
  4. Resolve sidecar paths relative to the run directory.
  5. Rebuild any local cache, search index, SQLite table, static report, or vendor projection from summary.json, index.jsonl, and sidecars.

Do not reconstruct paths from suite, name, test_id, target, or directory names. result_dir is readable when possible, but it is still an opaque run-local allocation that may be suffixed or otherwise changed to avoid collisions.

Do not treat derived artifacts as canonical:

  • Dashboard indexes are caches over the run bundle.
  • Search indexes are caches over rows and sidecars.
  • SQLite databases are query accelerators.
  • HTML reports are renderings.
  • Vendor-neutral projection bundles are adapter handoffs.
  • Phoenix, Langfuse, Opik, or other backend views are external projections or correlations, not AgentV’s source of truth.

Run an eval and inspect the portable bundle:

Terminal window
agentv eval evals/support/refunds.eval.yaml --experiment with_skills
ls .agentv/results/<run_id>
cat .agentv/results/<run_id>/summary.json
cat .agentv/results/<run_id>/index.jsonl

Find failed rows without loading every sidecar:

Terminal window
jq -r 'select(.execution_status != "ok" or .score < 0.5) |
[.eval_path, .test_id, .target, .grading_path] | @tsv' \
.agentv/results/<run_id>/index.jsonl

Compare two completed runs by their row indexes:

Terminal window
agentv compare \
.agentv/results/<baseline-run-id>/index.jsonl \
.agentv/results/<candidate-run-id>/index.jsonl

Generate a shareable report from the same canonical bundle:

Terminal window
agentv results report .agentv/results/<run_id>

An adapter that exports run results should treat index.jsonl as the row catalog:

import { createReadStream } from "node:fs";
import path from "node:path";
import { createInterface } from "node:readline";
export async function* rows(runDir: string) {
const rl = createInterface({
input: createReadStream(path.join(runDir, "index.jsonl"), "utf8"),
crlfDelay: Infinity,
});
for await (const line of rl) {
if (!line.trim()) continue;
yield JSON.parse(line) as Record<string, unknown>;
}
}
for await (const row of rows(".agentv/results/2026-run")) {
const gradingPath = row.grading_path;
if (typeof gradingPath === "string") {
console.log(path.join(".agentv/results/2026-run", gradingPath));
}
}

Adapter guidance:

  • Preserve unknown row fields when possible.
  • Prefer path fields such as grading_path, metrics_path, timing_path, transcript_path, and transcript_raw_path over ad hoc path construction.
  • Use artifact_pointers only for detached payload lookup; do not make pointers the discovery path for ordinary sidecars that are present in the run tree.
  • If you build a database or search index, store enough source metadata to rebuild it from the run bundle and invalidate it when summary.json or index.jsonl changes.
  • Keep backend-specific anonymization, upload, and schema mapping in the adapter layer. AgentV’s canonical bundle remains backend-neutral.