Trust Your LLM: Statistical Acceptance Sampling for Reliable Quality Measurement

By Agus Figueroa

Elevator Pitch

Learn how to confidently evaluate LLM outputs using statistical acceptance sampling. Discover a robust framework that determines exactly how many outputs to manually review, quantifying LLM quality at your chosen confidence level with minimal effort.

Description

This 30-minute talk demonstrates how acceptance sampling principles can determine LLM output quality with statistical confidence. Learn how to calculate the precise number of samples needed for evaluation based on your risk tolerance, replacing arbitrary sampling with a statistically sound methodology. Designed for professionals deploying LLMs who need to balance quality control with efficiency. You’ll gain practical tools to implement in your AI systems, enabling confident quality assertions while minimizing manual review.

Intro (5 min) - Introduction to the LLM quality assessment challenge at scale - The problem of determining representative sample sizes - Why we need standardized, scalable processes for LLM evaluation

Explaining the Approach (10 min) - Framing the problem: Why acceptance sampling and hypothesis testing are suitable! - Explaining the statistics: One sample hypothesis testing and how we used the hypergeometric distribution to solve the issue of large samples required for small populations. - Wrapping it up into an easy to follow decision framework.

Python Implementation (10 min) - Technical stack considerations for implementing acceptance sampling at scale - Building an MVP calculator that “does the job” - Practical challenges: Choosing the right metrics and confidence levels Possible to game the system with large effects

Questions (5 min) - Open questions

Notes

Why us?

  • Two data analysts with hands-on experience working with measurement and data products at GetYourGuide
  • Developed and deployed the statistical framework and calculator referenced in the talk

What does make our approach unique?

Unlike other approaches where one LLM audits another LLM, this methodology relies on foundational statistical principles from quality control. This provides interpretable, auditable results that business stakeholders can understand and trust.