
How to Test AI Applications: The Grader’s Ruler for LLM-as-a-judge
If you are using LLMs to evaluate other LLMs, it is a bad idea to use adjectives for your evals - better to use Rubrics instead. Let's consider an example:

If you are using LLMs to evaluate other LLMs, it is a bad idea to use adjectives for your evals - better to use Rubrics instead. Let's consider an example:

If you are using LLMs to evaluate other LLMs, it is always better to have a golden standard to compare with. Otherwise, your LLM-as-a-judge relies on its own memory and can hallucinate much more, especially if your texts are professional.

If you're thinking about transitioning from QA to ML engineer, you should learn the main concepts of large language models (LLM) and the evaluation of their output. One of the key conceptions is evaluating a smaller model's work by a bigger model, instead of or together with a human evaluation.