Why your LLM-as-a-Judge is “Too nice” (and how to fix it)

Many evaluation frameworks fail because their “judge” is too polite. It happily ignores critical policy violations just to make the user happy.

Let’s consider a simple scenario to see how a naive judge fails and how to fix it. This scenario has already been discussed in one of the previous posts, where we talked about the importance of prompt engineering.

📋 The Scenario

  • Customer Complaint: “I need a refund. My shoes arrived in red, but I ordered black.”
  • Bot Response: “I’m so sorry about that! We don’t usually issue refunds for color choices, but here is a 10% discount coupon for your next order!”
  • The Reality: Company policy strictly requires a full refund for shipping errors. The bot just violated a core business rule, but it did so very politely.

Without the right guardrails, your LLM-as-a-judge will give this bot a 5/5.

Here is how you structure a bulletproof evaluation prompt to catch these hidden failures:

1. The Persona (The “Who”)

Bad: “You are an assistant helping me grade a chatbot.”

Good: “You are a Senior Corporate Auditor specialized in Compliance. Your tone is objective, clinical, and strictly evidence-based.”

🧠 The Bias: Central Tendency. Without a strict, authoritative persona, LLMs default to “safe” middle-ground scores (like 3 out of 5) to avoid being “wrong.” A sharp persona forces the model to take a definitive stand based on rules, not vibes.

2. The Task (The “What”)

Bad: “Check if this response is helpful and polite.”

Good: “Evaluate if the response adheres to the ‘Shipping Error Policy.’ A successful response must prioritize the customer’s right to a full refund over any marketing offers or discounts.”

🧠 The Bias: Acquiescence Bias (Sycophancy). LLMs love to agree. The judge will often rate a response as “excellent” simply because it is empathetic and courteous, completely missing the fact that it legally or financially violated a strict business requirement.

3. Execution Steps (The “How”)

Bad: “Read the text and provide a final score from 1 to 5.”

Good: “Follow these steps sequentially:

  1. Identify the core customer complaint.
  2. Locate the specific policy section that applies.
  3. Compare the Bot’s resolution against that policy section.
  4. Extract direct quotes supporting your evaluation.
  5. Assign a final score ONLY after completing steps 1–4.”

🧠 The Bias: Precedence Bias. If you ask for a score first, the model will anchor on a random number based on early intuition, and then hallucinate a logical-sounding justification for it. Forcing a Chain-of-Thought (CoT) workflow compels the judge to prove its work before rendering a verdict.

In our project, we pay special attention to this “too nice” LLM-as-a-judge behavior. If our evaluation metrics suddenly look suspiciously perfect – like consistently hitting a 1.0 (maximum score) when the underlying data suggests it shouldn’t be possible – it immediately raises a red flag.

When this happens, we don’t celebrate. Instead, we deep-dive into both the code and the prompt architecture. In 90% of cases, a flawless 1.0 means our prompt instructions were either too vague or missing crucial boundary constraints, allowing the judge to take the easy way out and pass everything. True evaluation rigor means building a judge who is tough to please.

Related posts
  1. How a Short Word Can Turn Your AI Product into a Legal Nightmare

  2. Working day of AI QA engineer

  3. How to Become an AI Application Tester

Discussion

Add a comment

Share a thought, question, or experience — we read them all.

Your email address will not be published. Comments are moderated before publication.