DUPLICATES
Same scenario, many times
Repeated cases inflate scores and make coverage look broader than it is.
GoldenSetDoctor helps AI teams find hidden problems in their golden sets: duplicate test cases, leaked answers, weak rubrics, ambiguous prompts, missing metadata, and release-gating risks.
This eval set has enough leakage and rubric risk that the score may create false confidence.
GSD007
Expected answer appears in reference docs
Blocker
GSD008
Expected answer is too vague to grade
Blocker
GSD006
Several cases test the same scenario
Warn
GSD012
Case is missing source/provenance
Fixable
AI teams increasingly use evals to decide whether a prompt, model, agent, or RAG pipeline is ready to ship. But if the golden set is duplicated, leaky, ambiguous, or unrepresentative, the score can become a very confident lie.
DUPLICATES
Same scenario, many times
Repeated cases inflate scores and make coverage look broader than it is.
LEAKAGE
The answer is in the context
Closed-book tests quietly become open-book tests, and results stop meaning what the team thinks they mean.
RUBRICS
No one can grade it consistently
Weak expected answers create noisy judgments, reviewer debates, and unstable release calls.
GOVERNANCE
No source, owner, or severity
When a release depends on an eval, teams need to know where each case came from and why it matters.
Use your existing tools to score model outputs. Use GoldenSetDoctor to inspect whether the dataset behind those scores is trustworthy enough to influence a release decision.
The current prototype runs locally on JSONL datasets and optional reference docs. The intended v0.1 path is simple enough for a new user to try in minutes.
Each line is an eval case with an input, expected answer, and optional metadata such as tags or context URL.
GSD buckets intent, finds duplicate clusters, checks reference leakage, and lints rubric quality.
The HTML report explains the most serious risks, the affected cases, and suggested fixes.
The future gate command will fail CI when a golden set is not healthy enough for release gating.
# scan the eval dataset python3 -m gsd.cli scan evals.jsonl \ --refs docs/ \ --out .gsd/run.json \ --model gpt-4o-mini # render a local report python3 -m gsd.cli report \ .gsd/run.json \ --html .gsd/report.html # future v0.1 shape gsd gate .gsd/run.json --config gsd.yml
GoldenSetDoctor should become the layer teams run before trusting eval scores in release reviews, pull requests, and CI pipelines.
Run checks locally against test sets, reference docs, and eventually sampled production logs without requiring a hosted dashboard.
Move beyond raw issue lists into fitness scores, readiness states, blockers, warnings, caveats, and fix plans.
Work with JSONL today, then add adapters for CSV, Promptfoo, DeepEval, Braintrust, LangSmith, and GitHub Actions.
That is part of the product. A scan can find many problems inside an eval set, but it cannot certify correctness, production coverage, or legal/policy completeness without more context.
Not from the eval set alone. That requires source-of-truth docs, policies, changelogs, or domain review.
Only when production examples or intent distributions are supplied. That is a later context-aware check.
No. It inspects the dataset behind evals. It is meant to sit next to tools that score model outputs.
Because an eval set can regress like code. If a PR adds leakage or weak gating cases, the team should know before release.
GoldenSetDoctor is early, but the direction is clear: make eval-set quality visible, actionable, and hard to ignore before AI releases depend on it.