GoldenSetDoctor - Inspect LLM Eval Sets Before You Trust The Score

Why it exists

An eval score can look precise while the test set underneath is broken.

AI teams increasingly use evals to decide whether a prompt, model, agent, or RAG pipeline is ready to ship. But if the golden set is duplicated, leaky, ambiguous, or unrepresentative, the score can become a very confident lie.

DUPLICATES Same scenario, many times

Repeated cases inflate scores and make coverage look broader than it is.

LEAKAGE The answer is in the context

Closed-book tests quietly become open-book tests, and results stop meaning what the team thinks they mean.

RUBRICS No one can grade it consistently

Weak expected answers create noisy judgments, reviewer debates, and unstable release calls.

GOVERNANCE No source, owner, or severity

When a release depends on an eval, teams need to know where each case came from and why it matters.

What GSD checks

GSD is a health check for the eval set, not a replacement for your eval runner.

Use your existing tools to score model outputs. Use GoldenSetDoctor to inspect whether the dataset behind those scores is trustworthy enough to influence a release decision.

✓

Clusters near-duplicate cases so repeated scenarios are visible.

✓

Looks for expected answers that can be derived from reference documents.

✓

Flags missing, vague, placeholder, subjective, or multi-intent expected answers.

✓

Shows caveats for what cannot be proven from the eval file alone.

How it works

Run a scan, open a report, fix the highest-risk cases first.

The current prototype runs locally on JSONL datasets and optional reference docs. The intended v0.1 path is simple enough for a new user to try in minutes.

Point GSD at a JSONL eval set

Each line is an eval case with an input, expected answer, and optional metadata such as tags or context URL.

Scan for health issues

GSD buckets intent, finds duplicate clusters, checks reference leakage, and lints rubric quality.

Generate a shareable report

The HTML report explains the most serious risks, the affected cases, and suggested fixes.

Use the result before release

The future gate command will fail CI when a golden set is not healthy enough for release gating.

# scan the eval dataset
python3 -m gsd.cli scan evals.jsonl \
  --refs docs/ \
  --out .gsd/run.json \
  --model gpt-4o-mini

# render a local report
python3 -m gsd.cli report \
  .gsd/run.json \
  --html .gsd/report.html

# future v0.1 shape
gsd gate .gsd/run.json --config gsd.yml

End-state vision

The goal is eval-set governance without a heavyweight platform.

GoldenSetDoctor should become the layer teams run before trusting eval scores in release reviews, pull requests, and CI pipelines.

Local-first Keep sensitive eval data close.

Run checks locally against test sets, reference docs, and eventually sampled production logs without requiring a hosted dashboard.

Release-aware Turn findings into launch language.

Move beyond raw issue lists into fitness scores, readiness states, blockers, warnings, caveats, and fix plans.

Tool-friendly Fit beside the eval stack you already use.

Work with JSONL today, then add adapters for CSV, Promptfoo, DeepEval, Braintrust, LangSmith, and GitHub Actions.

Important honesty

GSD will tell you what it cannot know.

That is part of the product. A scan can find many problems inside an eval set, but it cannot certify correctness, production coverage, or legal/policy completeness without more context.

Can it prove my expected answers are correct?

Not from the eval set alone. That requires source-of-truth docs, policies, changelogs, or domain review.

Can it tell whether my eval represents production?

Only when production examples or intent distributions are supplied. That is a later context-aware check.

Does it run model evals?

No. It inspects the dataset behind evals. It is meant to sit next to tools that score model outputs.

Why should this run in CI?

Because an eval set can regress like code. If a PR adds leakage or weak gating cases, the team should know before release.

Start with a local scan. Build toward release readiness.

GoldenSetDoctor is early, but the direction is clear: make eval-set quality visible, actionable, and hard to ignore before AI releases depend on it.

Open GoldenSetDoctor on GitHub