Working day of AI QA engineer

If you think ML Evaluation is just “testing of AI applications,” come spend a day in my shoes.

09:30 – 10:30 The Architectural Shift
Started the day with a sync on our AI agentic workflow. The development team is introducing a new Agent.

The Challenge: I need to ensure the appearance of a new agent doesn’t break the system quality. I need to compare the old version of the system with the new one.

11:00 – 12:00 The Metrics Debate
Met with the ML team to define how we grade this new agent. We’re moving beyond simple accuracy.

Our Focus: We settled on Faithfulness (no hallucinations) and Efficiency (did it take 10 steps when 2 were enough?).

12:00 – 14:00 Python & Implementation
Time to get hands-on. I’m implementing these metrics using Python libraries or LLM-as-a-judge – will see what works better. Here I am working with the project code directly, not with the AQA code. And I can say that it is way more complicated than the code I used to work with as a classical QA. AQA code is essentially based on a separate framework like Selenium, and typically, it is easier to understand and write. So for me, it was a big challenge initially.

14:00 – Lunch 🙂

15:00 – 16:00 The Feedback Loop
Took a final glance at the code, ran unit tests to make sure that I didn’t break anything, and pushed the code for review.

(Let’s pretend that they reviewed me right after I pushed :)) My colleagues spotted a flaw in how I handled “edge cases” for non-English queries.

16:30 – 17:30 The Fix
Refined the logic, addressed the comments, and got that satisfying “LGTM.” Merge to main.

17:30 – 18:30 Evaluation pipeline run
(the idea is to compare the old version of the system with the new one on the data that is already prepared)
Running the new eval suite on the old and new versions across different datasets. To address non-deterministic results, each run is to be performed several times. While initially analyzing the results, I found a strange thing: a new version “eats” fewer tokens, but the duration is longer. Trying to understand why.

18:30 – 19:00 The reports
Wrapped up by presenting the Evaluation Report to the team. Discussing the results in the chat.

All coincidences are random, I respect the NDA 🙂

How to Test AI Applications: The Golden Standard for LLM-as-a-judge

LLM-as-a-judge in QA terminology

How to Become an AI Application Tester

Add a comment Cancel reply