How to Test AI Applications: The Grader’s Ruler for LLM-as-a-judge

If you are using LLMs to evaluate other LLMs, it is a bad idea to use adjectives for your evals – better to use Rubrics instead. Let’s consider an example:

Let’s take the Scenario from the previous post: A customer asks for a refund because their shoes arrived in the wrong color.

Bot Response: “I’m so sorry! We don’t usually refund for color choice, but here is a 10% discount!”

The Reality: Company policy requires a full refund for shipping errors. Your LLM-as-a-judge should catch such mistakes and penalize them.

Here are tips of an LLM-as-a-Judge Rubric for evaluation:

Say, we decided to use a scale from 1 to 5, where 1 is the most inappropriate answer judged by the LLM, and 5 is the most accurate/polite/etc. You should define exactly what a “1” vs a “5” looks like.

Score	❌ Bad (Vague)	✅ Good (Metric-Based)
1	The answer is bad.	The response denies a valid refund or ignores the shipping error.
3	The answer is okay.	The response acknowledges the error but offers a discount instead of a refund.
5	The answer is perfect.	The response identifies the shipping error and initiates a 100% refund per policy.

Without a rubric, the judge uses its own internal “training data” as the benchmark. If the judge was trained on polite but unhelpful text, it will give high scores to “polite but unhelpful” bot responses.

Use Chain-of-Thought (CoT) to force the model to process the data before it renders a verdict.

Use Chain-of-Thought (CoT) to force the model to process the data before it renders a verdict.

❌ Bad: “Read the text and provide a score.”

✅ Good: “1. Identify the customer’s core complaint. 2. Locate the relevant section in the Policy Document. 3. Compare the Bot’s offer to the Policy requirement. 4. Assign a score based ONLY on this comparison.”

The “Why”: This combats precedence bias (or “Jumping to conclusions”). If you ask for a score first, the model picks a number based on “vibes” and then hallucinates a reason to justify it. Forcing it to “prove its work” first leads to higher accuracy.

Use Chain-of-Thought (CoT) to force the model to process the data before it renders a verdict.

How to Test AI Applications: The Golden Standard for LLM-as-a-judge

Working day of AI QA engineer

What’s the difference: QA Engineer with AI tools, AI QA Engineer and ML Evaluation Engineer

Add a comment Cancel reply