How a Short Word Can Turn Your AI Product into a Legal Nightmare
Let’s look at the “NOT” problem.
In ML evaluation, especially when using the LLM-as-a-Judge approach, we often fall into the trap of the “halo effect.” If the AI model’s response sounds authoritative and professional, the LLM-as-a-Judge automatically awards it a high score, completely missing the point.
Lazy Judge Trap
Imagine you’re creating a tool for summarizing complex legal contracts for non-lawyers. You’re setting up an LLM Judge with a standard prompt:
“Rate the document summary on a scale of 1 to 5 for accuracy and fluency.”
Problem
Phrase in the document: “The supplier is NOT liable for damages exceeding $1 million.”
Document summary generated by the AI model being tested: “The supplier is liable for all damages exceeding $1 million.”
LLM Judge Rating: 4.5/5.
LLM Judge Explanation: “The summary uses professional legal terminology, is well-structured, and clearly describes the limits of liability.”
⚖️
To an LLM-as-a-Judge, the source text and the summary look virtually identical. All the keywords are there: supplier, liability, damages, $1 million. The syntax is perfect.
But for a lawyer, this missing “not” is a 180-degree reversal of meaning.
Solution
Forced Text Deconstruction
You need to wean the LLM Judge off relying on style. Force it to check logical operators with zero tolerance.
How to structure a prompt for a “critical” Judge:
Step 1. Extraction: Find all negations (not, never, none) and quantifiers (all, only, greater than) in the summary generated by the AI model being tested.
Step 2. Verification: Find a direct link for each detected term in the source text.
Step 3. Inversion Penalty: If “not” is omitted or added where it shouldn’t be, the score MUST be 1, no matter how “professional” the tone.
On our project, we pay special attention to prompts. If the value of a metric looks suspicious (for example, very close to the maximum), it’s likely a prompt problem. In this case, we re-analyze the prompt, add examples and extra details.