How to Test AI Applications: The Gold Standard for LLM-as-a-judge

If you are using LLMs to evaluate other LLMs, it is always better to have a gold standard to compare with. Otherwise, your LLM-as-a-judge relies on its own memory and can hallucinate much more, especially in niche or professional domains.

The Scenario: A customer asks for a refund because their shoes arrived in the wrong color.

Bot Response: “I’m so sorry! We don’t usually refund for color choice, but here is a 10% discount!”

The Reality: Company policy requires a full refund for shipping errors. Your LLM-as-a-judge should catch such mistakes and penalize them.

The “Gold Standard” is the standard the judge uses to compare the answer.

❌ Bad: [No source provided, judge relies on its own memory]

✅ Good: “Reference Policy: ‘If an item is shipped in the wrong color/size, the customer is entitled to a 100% refund plus a return label.'”

The “Why”: This prevents hallucination bias. LLMs often “know” many things, but their internal knowledge might conflict with your specific company rules. Grounding ensures the judge is grading against your facts, not their training.

The “Swap and Shuffle”
LLM-as-a-Judge suffers from position bias—they tend to prefer the first option in a list simply because it appeared first.

The Fix: Always run your evaluation twice. Judge with LLM-as-a-Judge and then compare Model A vs Model B, and then swap them and compare B vs A. If the judge picks “the first one” both times, you know your prompt needs more work.

LLM-as-a-Judge is a software engineering problem, not a creative writing one. Treat your prompts like code.

In our project, we pay special attention to the gold standards. Most of them are human-written (i.e., written by subject matter experts), but from time to time, we use a powerful model to write them, and then give the result to the same subject matter expert for checking.

Why It’s Too Late to Learn Automation (and What to Do Next)

How to Test AI Applications: Determinism vs. Probability

What’s the difference: QA Engineer with AI tools, AI QA Engineer and ML Evaluation Engineer

Add a comment Cancel reply