How to Test AI Applications: Determinism vs. Probability

Why the traditional QA “A + B = C” rule is broken for AI application testing and what to do?

Traditional QA, armed with AI tools, essentially does not differ from testing without them. You plan your coverage based on classic test design methods like equivalence partitioning or pairwise testing. You work with the mindset that if A + B is expected to be C, and A + B is not C, it’s a defect.

In this deterministic world, it rarely makes sense to run the exact same test twice. AI tools can help check coverage quality, write test cases, or generate automation to drastically increase productivity, but they don’t change the essence of testing.
If A + B should always be C, you check that equality – with or without AI instruments.

However, if you are an AI QA or ML Evaluation Engineer, your A + B could be C on the first run, C + k on the second, and C – k on the third.

You live in a world of dice rolls.

Essentially, you are measuring the probability of the result.
To calculate that probability, you must have a statistically representative number of observations.

Why is this the case?

Because, in truth, even the developers of Large Language Models (LLMs) don’t know exactly how they work in every detail. Yes, they understand the architecture (the Transformer, for example) and the underlying principles, but they cannot say that for a given input, the signal will always traverse the same path.
The “path” can vary, and so can the final result.

What can an AI QA do to handle this?

1. Brush up on statistics: Understand concepts like “representative samples” and confidence intervals.
2. Execute n-runs: Run every test N times to see the distribution.
3. Stress the prompts: Use variants to understand exactly where the deviations begin.

For example, if testing an AI’s ability to “draw a house”:

Test a baseline prompt: “Draw me a house.”
Test a highly complex rephrasing: “I formally request that you execute a bi-dimensional graphical representation of a domestic domicile…”
Iteratively add specific details (colors, materials, etc.).

Repeat each prompt N times to gather meaningful statistics.

Finally, you must plan which metrics you will use to evaluate these results.
That is a topic for at least one more post – perhaps even a whole series.

In our project, we run the pipeline several times against the same or similar data and then calculate the average value of each metric. This method allows for minimizing the effect of hallucinations and achieving a better result.

Will AI pass a code review?

Why It’s Too Late to Learn Automation

LLM-as-a-Judge in QA terminology

Add a comment Cancel reply