Why your LLM-as-a-Judge is “Too nice” (and how to fix it)
Many evaluation frameworks fail silently because the LLM-as-a-judge is engineered to be agreeable rather than accurate. It will happily ignore a critical policy violation if the bot's tone was warm and apologetic.