How to Test AI Applications: The Gold Standard for LLM-as-a-judge

If you are using LLMs to evaluate other LLMs, it is always better to have a gold standard to compare with. Otherwise, your LLM-as-a-judge relies on its own memory and can hallucinate much more, especially in niche or professional domains.

The Scenario: A customer asks for a refund because their shoes arrived in the wrong color.

Bot Response: “I’m so sorry! We don’t usually refund for color choice, but here is a 10% discount!”

The Reality: Company policy requires a full refund for shipping errors. Your LLM-as-a-judge should catch such mistakes and penalize them.

The “Gold Standard” is the standard the judge uses to compare the answer.

❌ Bad: [No source provided, judge relies on its own memory]

✅ Good: “Reference Policy: ‘If an item is shipped in the wrong color/size, the customer is entitled to a 100% refund plus a return label.'”

The “Why”: This prevents hallucination bias. LLMs often “know” many things, but their internal knowledge might conflict with your specific company rules. Grounding ensures the judge is grading against your facts, not their training.

The “Swap and Shuffle”
LLM-as-a-Judge suffers from position bias – they tend to prefer the first option in a list simply because it appeared first.

The Fix: Always run your evaluation twice. Judge with LLM-as-a-Judge and then compare Model A vs Model B, and then swap them and compare B vs A. If the judge picks “the first one” both times, you know your prompt needs more work.

LLM-as-a-Judge is a software engineering problem, not a creative writing one. Treat your prompts like code.

In our project, we pay special attention to the gold standards. Most of them are human-written (i.e., written by subject matter experts), but from time to time, we use a powerful model to write them, and then give the result to the same subject matter expert for checking.

Related posts
  1. How a Short Word Can Turn Your AI Product into a Legal Nightmare

  2. Why It’s Too Late to Learn Automation

  3. A take-home assignment for an AI QA role

Discussion

Add a comment

Share a thought, question, or experience — we read them all.

Your email address will not be published. Comments are moderated before publication.