Local CLI today. Eval governance layer tomorrow.

Inspect your LLM eval set before you trust the score.

GoldenSetDoctor helps AI teams find hidden problems in their golden sets: duplicate test cases, leaked answers, weak rubrics, ambiguous prompts, missing metadata, and release-gating risks.

What is an eval set? A collection of example tasks used to test whether an AI system behaves correctly.
What can go wrong? The examples can be duplicated, stale, leaked, unclear, or impossible to grade consistently.
What does GSD do? It audits the eval set itself, then produces a report your team can act on.
report.html
68
Eval Fitness Score
Not ready for release gating

This eval set has enough leakage and rubric risk that the score may create false confidence.

GSD007 Expected answer appears in reference docs Blocker
GSD008 Expected answer is too vague to grade Blocker
GSD006 Several cases test the same scenario Warn
GSD012 Case is missing source/provenance Fixable
Why it exists

An eval score can look precise while the test set underneath is broken.

AI teams increasingly use evals to decide whether a prompt, model, agent, or RAG pipeline is ready to ship. But if the golden set is duplicated, leaky, ambiguous, or unrepresentative, the score can become a very confident lie.

DUPLICATES Same scenario, many times

Repeated cases inflate scores and make coverage look broader than it is.

LEAKAGE The answer is in the context

Closed-book tests quietly become open-book tests, and results stop meaning what the team thinks they mean.

RUBRICS No one can grade it consistently

Weak expected answers create noisy judgments, reviewer debates, and unstable release calls.

GOVERNANCE No source, owner, or severity

When a release depends on an eval, teams need to know where each case came from and why it matters.

What GSD checks

GSD is a health check for the eval set, not a replacement for your eval runner.

Use your existing tools to score model outputs. Use GoldenSetDoctor to inspect whether the dataset behind those scores is trustworthy enough to influence a release decision.

Clusters near-duplicate cases so repeated scenarios are visible.
Looks for expected answers that can be derived from reference documents.
Flags missing, vague, placeholder, subjective, or multi-intent expected answers.
Shows caveats for what cannot be proven from the eval file alone.
How it works

Run a scan, open a report, fix the highest-risk cases first.

The current prototype runs locally on JSONL datasets and optional reference docs. The intended v0.1 path is simple enough for a new user to try in minutes.

01
Point GSD at a JSONL eval set

Each line is an eval case with an input, expected answer, and optional metadata such as tags or context URL.

02
Scan for health issues

GSD buckets intent, finds duplicate clusters, checks reference leakage, and lints rubric quality.

03
Generate a shareable report

The HTML report explains the most serious risks, the affected cases, and suggested fixes.

04
Use the result before release

The future gate command will fail CI when a golden set is not healthy enough for release gating.

# scan the eval dataset
python3 -m gsd.cli scan evals.jsonl \
  --refs docs/ \
  --out .gsd/run.json \
  --model gpt-4o-mini

# render a local report
python3 -m gsd.cli report \
  .gsd/run.json \
  --html .gsd/report.html

# future v0.1 shape
gsd gate .gsd/run.json --config gsd.yml
End-state vision

The goal is eval-set governance without a heavyweight platform.

GoldenSetDoctor should become the layer teams run before trusting eval scores in release reviews, pull requests, and CI pipelines.

Local-first Keep sensitive eval data close.

Run checks locally against test sets, reference docs, and eventually sampled production logs without requiring a hosted dashboard.

Release-aware Turn findings into launch language.

Move beyond raw issue lists into fitness scores, readiness states, blockers, warnings, caveats, and fix plans.

Tool-friendly Fit beside the eval stack you already use.

Work with JSONL today, then add adapters for CSV, Promptfoo, DeepEval, Braintrust, LangSmith, and GitHub Actions.

Important honesty

GSD will tell you what it cannot know.

That is part of the product. A scan can find many problems inside an eval set, but it cannot certify correctness, production coverage, or legal/policy completeness without more context.

Can it prove my expected answers are correct?

Not from the eval set alone. That requires source-of-truth docs, policies, changelogs, or domain review.

Can it tell whether my eval represents production?

Only when production examples or intent distributions are supplied. That is a later context-aware check.

Does it run model evals?

No. It inspects the dataset behind evals. It is meant to sit next to tools that score model outputs.

Why should this run in CI?

Because an eval set can regress like code. If a PR adds leakage or weak gating cases, the team should know before release.

Start with a local scan. Build toward release readiness.

GoldenSetDoctor is early, but the direction is clear: make eval-set quality visible, actionable, and hard to ignore before AI releases depend on it.

Open GoldenSetDoctor on GitHub