A take-home assignment for an AI QA role

ML evaluation take-home task: how does it look like and what do employers expect?
The Scenario:
A medical consultation app is seeing a spike in user complaints, yet the internal sentiment model still reports high “Global Accuracy.” Your mission: Find the blind spots the metrics are hiding.
The Data
1,000 user reviews (JSON format) containing ground truth, model predictions, and confidence scores.
What is expected as a result
It is not enough to just demonstrate your coding ability. Evaluation is about the “So what?”
A Structured Audit: A text explanation of where the blind spots are, backed by numbers.
Visuality: Calibration Curves and Confusion Matrices that prove why the old metrics missed the gaps.
To ace this, you need a hybrid profile
- Theoretical Base: Knowing how models fail and which metrics apply to specific edge cases.
- Data Intuition: The ability to hunt for gaps manually and automatically.
- Engineering Rigor: Python skills for pipeline creation and implementing LLM-as-a-judge.
- Communication: The ability to share findings in a structured, accurate, and grammatically correct manner.
Now let’s decompose our hypothetical task completion into phases.
Phase 1: Data Analysis
Before writing a single line of code, you must audit the data distribution:
- Check for Class Imbalance: Does “Positive” feedback outweigh “Negative” 10-to-1? If so, your Accuracy metric is lying to you.
- Audit for Bias: Is the model underperforming on specific demographic slices (e.g., medical jargon vs. plain English)?
Note: In medical AI, imbalance isn’t just about labels; it’s about representation. - Critique the Status Quo: Why did the previous “Global Accuracy” fail? Compare it against metrics that actually matter for imbalanced data.
Phase 2: The “Architect” Phase (Implementation)
Now, build the evaluation framework:
- Architecture: Use clean, modular code: Whether you use Scikit-learn or Pandas, show that you care about reproducibility.
- LLM-as-a-Judge vs. Deterministic Metrics: Decide where you need statistical libraries and where an LLM might be needed to “judge” the nuance of sarcasm or complex medical sentiment.
- Confidence vs. Correctness: Code a check for “Confidently Incorrect” predictions. These are your highest-risk errors.
Phase 3: The “Strategist” Phase (Reporting)
- Visual Evidence: Provide Calibration Curves and Confusion Matrices.
The “Blind Spot” Brief: Structure your findings. Where exactly is the gap? Is the model missing “Negative” reviews because they contain technical medical terms? Explain why the old metrics missed these critical failures.
💡 Pro-Tip for Candidates
Employers in ML Eval aren’t looking for a “Data Scientist Lite.” They are looking for Quality & Reliability Engineers. Your GitHub shouldn’t just have a .py file; it should have a README that tells a story of risk and mitigation.


