How a Short Word Can Turn Your AI Product into a Legal Nightmare

Let’s look at the “NOT” problem.

In ML evaluation, especially when using the LLM-as-a-Judge approach, we often fall into the trap of the “halo effect.” If the AI model’s response sounds authoritative and professional, the LLM-as-a-Judge automatically awards it a high score, completely missing the point.

Lazy Judge Trap

Imagine you’re creating a tool for summarizing complex legal contracts for non-lawyers. You’re setting up an LLM Judge with a standard prompt:

“Rate the document summary on a scale of 1 to 5 for accuracy and fluency.”

Problem

Phrase in the document: “The supplier is NOT liable for damages exceeding $1 million.”

Document summary generated by the AI model being tested: “The supplier is liable for all damages exceeding $1 million.”

LLM Judge Rating: 4.5/5.

LLM Judge Explanation: “The summary uses professional legal terminology, is well-structured, and clearly describes the limits of liability.”

⚖️
To an LLM-as-a-Judge, the source text and the summary look virtually identical. All the keywords are there: supplier, liability, damages, $1 million. The syntax is perfect.

But for a lawyer, this missing “not” is a 180-degree reversal of meaning.

Solution

Forced Text Deconstruction
You need to wean the LLM Judge off relying on style. Force it to check logical operators with zero tolerance.

How to structure a prompt for a “critical” Judge:

Step 1. Extraction: Find all negations (not, never, none) and quantifiers (all, only, greater than) in the summary generated by the AI model being tested.

Step 2. Verification: Find a direct link for each detected term in the source text.

Step 3. Inversion Penalty: If “not” is omitted or added where it shouldn’t be, the score MUST be 1, no matter how “professional” the tone.

On our project, we pay special attention to prompts. If the value of a metric looks suspicious (for example, very close to the maximum), it’s likely a prompt problem. In this case, we re-analyze the prompt, add examples and extra details.

Problem

Solution

Why your LLM-as-a-Judge is “Too nice” (and how to fix it)

LLM-as-a-Judge in QA terminology

Working day of AI QA engineer

Add a comment Cancel reply