Elevator Pitch
Learn how to confidently evaluate LLM outputs using statistical acceptance sampling. Discover a robust framework that determines exactly how many outputs to manually review, quantifying LLM quality at your chosen confidence level with minimal effort.
Description
This 30-minute talk demonstrates how acceptance sampling principles can determine LLM output quality with statistical confidence. Learn how to calculate the precise number of samples needed for evaluation based on your risk tolerance, replacing arbitrary sampling with a statistically sound methodology. Designed for professionals deploying LLMs who need to balance quality control with efficiency. You’ll gain practical tools to implement in your AI systems, enabling confident quality assertions while minimizing manual review.
Intro (5 min) - Introduction to the LLM quality assessment challenge at scale - The problem of determining representative sample sizes - Why we need standardized, scalable processes for LLM evaluation
Explaining the Approach (10 min) - Framing the problem: Why acceptance sampling and hypothesis testing are suitable! - Explaining the statistics: One sample hypothesis testing and how we used the hypergeometric distribution to solve the issue of large samples required for small populations. - Wrapping it up into an easy to follow decision framework.
Python Implementation (10 min) - Technical stack considerations for implementing acceptance sampling at scale - Building an MVP calculator that “does the job” - Practical challenges: Choosing the right metrics and confidence levels Possible to game the system with large effects
Questions (5 min) - Open questions
Notes
Why us?
- Two data analysts with hands-on experience working with measurement and data products at GetYourGuide
- Developed and deployed the statistical framework and calculator referenced in the talk
What does make our approach unique?
Unlike other approaches where one LLM audits another LLM, this methodology relies on foundational statistical principles from quality control. This provides interpretable, auditable results that business stakeholders can understand and trust.