The Scale Problem
At 1,000 evals/month, you can manage manually: scripts, notebooks, a person running evals on their laptop. Performance is acceptable, cost is low, complexity is manageable. But at 1M/month, manual management breaks. At 100M/month, you need dedicated infrastructure. This chapter covers the transition from ad hoc eval to systematic, scalable eval orchestration.
The challenge is qualitatively different at each scale tier. At 10K evals/month, you need basic automation. At 100K/month, you need distributed workers. At 1M/month, you need sophisticated job scheduling, failure recovery, and monitoring. At 100M/month, you need horizontal scaling, real-time routing, and cost optimization that rivals production ML infrastructure.
Eval Orchestration Defined
Eval orchestration is the system that coordinates all pieces of the evaluation pipeline. It handles: job scheduling (when should evals run?), worker management (which worker runs which eval?), result aggregation (collecting scattered results into a unified view), anomaly detection (did something go wrong?), storage (where do results live?), and reporting (how do we communicate results?).
The orchestrator is the "operating system" of your eval infrastructure. Just as operating systems manage processes, memory, and I/O, eval orchestrators manage eval jobs, worker capacity, and results pipeline. A good orchestrator is invisible when working but critical when things fail.
Architecture for Scale
High-Level Pipeline:
Eval Job Queue
↓
Worker Pool (distributed)
↓ ↓ ↓
Worker1 Worker2 Worker3
↓
Result Aggregator
↓
Result Store (S3 or BigQuery)
↓
Monitoring & Alerts
↓
Dashboard & Reports
Key Components:
1. Job Queue: Distributed message queue (Kafka, RabbitMQ, AWS SQS) holds pending eval jobs. Jobs are items (model output, ground truth, evaluation code) waiting to be processed. The queue decouples job submission from evaluation, allowing surge capacity handling.
2. Worker Pool: Stateless workers consume jobs from the queue, execute evaluations, and produce results. Workers are horizontally scalable: add more workers when queue depth increases, remove workers when queue empties. Each worker is idempotent (same job produces same result).
3. Result Aggregator: Collects individual eval results, performs rollup calculations (aggregate metrics, statistical summaries), and outputs to storage. The aggregator handles late-arriving results, retries, and deduplication.
4. Storage Layer: Results go to S3 (for large-scale batch analytics) or BigQuery (for low-latency querying). Design schema for fast queries: partitioned by date and model version, indexed by key dimensions.
5. Monitoring & Alerting: Track: queue depth (is it growing or shrinking?), evaluation latency (how long per eval?), error rate (what fraction fail?), worker health (are workers crashing?). Alert on anomalies.
Practical Implementation: Use a cloud-native stack. AWS: SQS (queue) + Lambda (workers) + S3 (storage) + CloudWatch (monitoring). GCP: Cloud Tasks (queue) + Compute Engine (workers) + BigQuery (storage) + Cloud Monitoring. Azure: Service Bus (queue) + Container Instances (workers) + Data Lake (storage).
LLM Judge at Scale
Cost Management: LLM judge costs scale linearly with volume. At 1M evals/month: 1M × $0.10 per eval = $100K in LLM API costs alone. Strategies to reduce costs:
- Model Selection: Cheaper models (GPT-3.5, Claude Haiku) for simpler eval tasks. Expensive models (GPT-4, Claude Opus) for complex eval only when necessary. Route by task complexity.
- Prompt Caching: If evaluating many outputs with the same rubric or context, use prompt caching. OpenAI charges 10% of normal price for cached tokens. 90% savings on repeated context.
- Batching: Batch multiple evals into a single API call. Instead of 1M API calls, make 10K calls with 100 items each. Reduces latency, improves throughput.
- Local Models: Run open-source models (Llama, Mistral) locally on GPU. Eliminates per-token costs. Initial infrastructure investment but pays off at scale (break-even at ~500K evals).
Rate Limit Management: API providers rate-limit to prevent abuse. At scale, you'll hit rate limits. Strategies: exponential backoff with jitter (don't retry too aggressively), queue management (maintain request pacing), multiple API keys (spread load), dedicated capacity options (buy reserved capacity from providers).
Failover & Retry: If an LLM API call fails: retry with exponential backoff, try a fallback model, mark as failed if all retries exhausted. Track failures separately from successful evals—don't average failures into results.
Human Eval at Scale
Annotation Management at Volume: At 1M evals/month, even a small human-eval percentage (5%) means 50K annotations. Managing this at scale requires systems for: task distribution, worker communication, quality control, and payment.
Worker Pool Management: Recruit 100+ qualified annotators. Segment by specialization (some annotators are expert in code, others in language). Track: per-worker quality, turnaround time, availability. Remove underperformers, incent high performers.
Quality Control Sampling: Don't review all 50K annotations—impossible. Instead: sample 5-10% for quality review by senior annotators. Calculate inter-rater agreement on sample. If agreement is low, investigate (rubric issues? worker training issues?) and retrain.
Annotation Velocity Tracking: Monitor: annotations per day per worker, days to completion for batches. If velocity drops, investigate (are workers burned out? Is task complexity increasing?). Use velocity forecasts to estimate turnaround times for upcoming batches.
SLA Management: Commit to turnaround times ("annotations within 48 hours of request"). Track achievement rate. When SLAs are at risk, escalate to senior workers or increase compensation to attract more workers.
Parallelism and Efficiency
Batching Strategies: Evaluate items in batches, not individually. Batch size determines parallelism: batch of 1 = sequential, batch of 1000 = massively parallel. Trade-offs: larger batches improve throughput but increase memory usage and latency for individual items.
Async vs. Sync Evaluation: Sync evals block until complete (slow, but simple). Async evals return immediately and provide results later (fast, but complex). At scale, use async.
GPU Utilization: If running local eval models on GPUs, maximize utilization. Batch aggressively, monitor GPU memory and compute. Underutilized GPUs are expensive; aim for >70% utilization.
Caching Repeated Evaluations: If the same output is evaluated multiple times (common in experimentation), cache results and reuse. Saves 90%+ of eval time on repeated items.
Incremental Evaluation: When only some outputs changed, re-evaluate only the changed items. Full re-evaluation of a 1M-item dataset is expensive; incremental eval might cost 1% of that. Requires tracking which outputs were already evaluated.
Result Aggregation and Storage
Schema Design for Scale: Design your schema with analytics in mind. Partition by: date (enables range queries), model version (enables comparing versions), task type (enables slicing). Use Parquet format for compression and query performance.
Time-Series vs. Snapshot Storage: Time-series: store every single eval result (100M rows/month). Enables fine-grained analysis but expensive to store and query. Snapshot: aggregate daily (30 rows/month). Cheaper, but loses granularity. Hybrid: time-series for last 90 days, snapshots for historical data.
Partitioning Strategies: Partition by date first (enables efficient time-range queries). Then partition by model version (enables fast "which model is better" comparisons). Use Hive partitioning or equivalent.
Data Retention Policies: Not all data is equally valuable. Raw results from 2 years ago are rarely queried. Policy: 1-year full retention (searchable, raw), 3-year aggregated retention (only summaries), 7-year archive (for compliance). Delete aggressively after retention window.
Monitoring Your Eval System
System Health Metrics:
- Queue Depth: Number of pending jobs. Growing queue = workers can't keep up. Healthy is queue depth <1 hour of work.
- Evaluation Latency: Time from job submission to result. p50 and p99 matter. p50 >1 second or p99 >10 seconds suggests bottleneck.
- Error Rate: Fraction of jobs that fail. Healthy <1%. >5% suggests underlying problem (bad data? buggy eval code? API issues?).
- Rater Availability: For human eval: % of slots filled by available raters. <80% means you can't achieve required annotation velocity.
- Cost per Eval: Total spend / number of evals. Track daily. >$0.10 per eval (for simple tasks) suggests optimization opportunity.
Alerting Thresholds: Set alerts for: queue depth increasing for >1 hour, error rate > 5%, latency p99 > 30 seconds, daily cost > budget, rater availability < 60%. These suggest action is needed.
Multi-Model Evaluation Routing
At scale, you evaluate many models in parallel. Route evaluations efficiently: cheaper models for easy tasks, expensive models for hard tasks. Route based on: task complexity (detected from description or history), confidence needed (high-stakes evals need better judges), model type (some models are better at code, others at text).
Confidence-Based Routing: Get initial evaluation from a cheap model. If confidence is high (>90%), send to results. If confidence is medium, escalate to better model. If confidence is low, send to human review. This "triage" pattern reduces cost while maintaining quality.
Dynamic Routing: Track per-model accuracy. If a cheap model's accuracy drops below threshold, increase routing to better models. If accuracy is high, increase cheap model usage. Adapt routing dynamically based on observed performance.
Case Study: Eval Infrastructure at Scale
Fictional Large AI Lab: ACME AI Research — 50M evaluations/month, 12 model variants, 200+ human raters, 99.9% uptime requirement.
Volume Breakdown: 40M automated evals (80%), 10M human evals (20%). Models evaluated: 8 internal variants + 4 benchmarks. Each eval tests multiple dimensions (correctness, quality, safety).
Infrastructure Decisions: (1) Job queue: Kafka (handles 100K+ msgs/sec needed). (2) Workers: Kubernetes cluster (400 pods, auto-scaling 200-1000 based on queue). (3) LLM judging: 60% cheaper model (GPT-3.5), 30% medium model (Claude), 10% expert model (GPT-4). Routes by task complexity. (4) Human evals: 200 specialized raters distributed globally. (5) Storage: BigQuery for results + S3 for raw outputs. (6) Monitoring: Prometheus for infrastructure, custom dashboard for eval metrics.
Cost Breakdown: LLM API: $400K/month. Infrastructure (k8s, storage): $200K/month. Annotation: $500K/month (200 raters × $2,500/month). Total: ~$1.1M/month.
Optimization Realized: Initial naive approach: all evals on GPT-4. Cost: $5M/month. Optimized approach: tiered models + routing. Cost: $1.1M/month. Savings: 78%.
Framework Summary
- Eval Orchestration: The system managing job scheduling, worker management, result aggregation, and monitoring.
- Architecture: Job queue → worker pool → result aggregator → storage → monitoring/dashboard.
- LLM Judge Scaling: Model selection by task, prompt caching, batching, local models, rate limit management.
- Human Eval Scaling: Worker pool management, quality control sampling, velocity tracking, SLA management.
- Parallelism: Batching, async evaluation, GPU utilization, caching, incremental evaluation.
- Storage: Schema design for analytics, partitioning, time-series vs. snapshot tradeoffs, retention policies.
- Monitoring: Queue depth, latency, error rate, rater availability, cost per eval.
- Routing: Route by task complexity, model type, confidence. Adapt dynamically based on performance.
- Case Study: 50M evals/month achievable with 78% cost savings through intelligent tiering and routing.
Build Your Eval Infrastructure
Start with basic Kafk queue + workers. Add storage layer. Add monitoring. As you scale (>10M evals/month), optimize: tiered models, prompt caching, incremental eval. The infrastructure should be mostly invisible, but when something breaks, good monitoring reveals the problem immediately.
Explore More Frameworks