# Eval Anything - Why EvalQA Is the Next-Gen Evaluation Platform

> A research-grade vision document. Foundation models were the warm-up. The real prize is becoming the evaluation layer for **anything that takes an input and produces a judgeable output** - agents, robots, SaaS features, copilots, knowledge workers, physical systems.

---

## TL;DR

The 2024 - 2026 evaluation tooling landscape is large, well-funded, and almost entirely pointed at one thing: **scoring a single LLM response on a single chat turn**. As AI escapes the chat box - into multi-step agents, embodied robots, SaaS features, and human-AI hybrid work - every existing tool finds itself wedged into a corner of the problem.

EvalQA's wedge is a different one. We're not building "yet another LLM-eval framework." We're building **the unified evaluation substrate** for any system, in any market, with a single schema, a single rater contract, and a Trained Eval Army that can be deployed against any modality.

Three structural bets:

1. **Schema-first, market-template UX.** One contract; many forms. The same JSON schema underpins forms for foundation models, agents, RAG, robotics, SaaS features, and end-user thumbs. Each market sees only what's relevant - progressive disclosure done right.
2. **Hybrid by default, not by exception.** LLM-as-judge is *cheap and noisy.* Trained humans are *expensive and reliable*. We pair them: AI pre-fills, humans calibrate. The Eval Army gives us reliability at LLM-judge cost.
3. **Certification as moat.** Every other player tries to win on tooling. We win on **trained, certified evaluators** - L1 to L5 - who pass benchmarks the way auditors pass CPA exams. The form is the front-end; the certification is the durable advantage.

---

## 1. The market today - and why it's about to break

### 1.1 What the field actually built between 2023 and 2026

The current eval stack - by category, with the dominant player(s) - looks like this:

| Layer | Tool / standard | What it's good at | What it doesn't touch |
|---|---|---|---|
| **Benchmark layer** | MT-Bench, MMLU, SWE-Bench Verified, Arena-Hard, AlpacaEval 2, BIGGEN-Bench | Comparing frontier LLMs on shared prompts | Anything in your product; anything beyond text |
| **LLM-as-judge harness** | Inspect AI (UK AISI), OpenAI Evals, DeepEval, LangSmith, Braintrust, Phoenix/Arize, MLflow Evaluate | Programmatic scoring of LLM output against rubrics | Human-in-the-loop at scale; non-LLM systems |
| **Safety / responsible AI** | MLCommons AILuminate v1.0, HELM Safety, NIST AI RMF, EU AI Act | Standardized hazard taxonomy + benchmark | Operationalizing eval inside a product team's day-to-day |
| **Annotation / RLHF** | Label Studio, Argilla, Surge AI, Scale AI Rapid, Prolific | Producing training data, preference pairs | The eval *experience* - they're labeling tools, not eval tools |
| **Production observability** | Galileo, Confident AI, LangSmith, Phoenix | Drift detection, trace analysis post-launch | Pre-launch hill-climbing; non-LLM evals |
| **RAG-specific** | Ragas, TruLens | Per-stage metrics (faithfulness, recall, precision) | Anything that isn't text retrieval |

Every cell in that table is built around a single assumption: **the thing being evaluated is an LLM output.** Some tools stretch into agents (Inspect AI's tool-use scoring, LangSmith's trajectories). None of them think about robotics, SaaS features, or human-AI hybrid work as first-class.

Anthropic's "Demystifying evals for AI agents" (2026) made the field's blindspot explicit: agents need **transcript + outcome + harness** evaluation, not just output scoring. The same logic generalizes to anything that acts in the world.

### 1.2 The three forces that are about to break the current stack

**Force 1: Agents act, they don't just answer.** SWE-Bench Verified jumped from 40% to 80%+ in a year. τ²-Bench, Terminal-Bench, WebArena, OSWorld - the entire frontier is moving to multi-step agents in real environments. Anthropic's own published vocabulary now distinguishes *transcript* from *outcome*: an agent can "fail" a test because of a stale instruction while actually solving the problem better than asked. Output-only eval breaks here.

**Force 2: AI is escaping the chat box into the physical world.** Figure, 1X, Boston Dynamics, Tesla Optimus, Skild, Physical Intelligence, Covariant, RT-X family - embodied AI is shipping. Eval here means task completion in a physical setting, perception accuracy, control safety, robustness to occlusion, recovery from contact. Zero existing LLM-eval tools generalize. Robotics shops today either roll their own (closed) or use academic benchmarks that don't match deployment.

**Force 3: AI is embedded everywhere in SaaS.** Every B2B SaaS app shipped between 2024 and 2026 added an AI-powered feature - autocomplete, summarization, draft generation, intent classification, in-context suggestions, automated triage. Each PM in each company runs ad-hoc evals in spreadsheets. There is no shared standard for "is this auto-generated email good enough to ship?" The market is huge, fragmented, and underserved.

### 1.3 What Label Studio just shipped - and why we go further

In January 2026, Label Studio published [Building the New Human Evaluation Layer for AI and Agentic Systems](https://labelstud.io/blog/new-evaluation-engine/). Their thesis is correct: human eval has not kept up; modern AI systems generate *traces, tool calls, multimodal outputs, branching state*, not just answers. Their new engine is **programmable, embeddable, multimodal** - the right design tenets.

The gap they leave open:

- **It's still a labeling tool.** Programmable interfaces are great if you're a labeling-platform admin. They're an empty page if you're a product manager who needs to evaluate the AI feature shipping next week.
- **No vertical templates.** Customers must design their own interface for every job.
- **No trained workforce.** The interface is the product; the humans you hire to use it are someone else's problem.
- **No certification.** Two raters with the same UI produce wildly different judgments because no one verified they share a rubric.

EvalQA closes all four gaps by treating the form as a *finished product per market*, the schema as the *durable contract*, and the **Eval Army** as the *workforce layer Label Studio explicitly punts on*.

---

## 2. The thesis - Eval Anything

> *Anything that takes an input and produces a judgeable output is in our addressable market.*

That includes:

- **Foundation models** - chat, instruction following, multi-turn dialog.
- **AI agents** - coding agents, browser agents, research agents, computer-use agents.
- **RAG / knowledge** - enterprise search copilots, customer-support assistants, document Q&A.
- **Robotics & embodied AI** - manipulation policies, navigation stacks, drone autonomy, surgical robotics.
- **SaaS AI features** - every product with autocomplete, summarization, suggested-replies, smart filters, AI-generated content.
- **Knowledge work** - humans plus AI doing claims-processing, code review, design feedback, contract redlining.

The common DNA: **input → process → output → judgement**. The judgement step is the same shape across all of them - what changed, what's good, what's wrong, what's unsafe, what to fix.

That's the form. That's the schema. That's the product.

---

## 3. The design principles - why filling out an EvalQA form has to feel like one tap

Every existing eval tool fails the **time-to-first-rating** test: you arrive on the page, you have to read a manual, configure a project, attach a dataset, build an interface, then finally judge one item. Onboarding is measured in hours.

We're aiming for the opposite extreme: **first rating in under ten seconds.** Six design principles, drawn from the cognitive-load literature and from how Stripe Checkout, Linear's command palette, Apple's pickers, and GOV.UK's "one thing per page" pattern earn their reputations.

### 3.1 Progressive disclosure - three tiers, opt-in deeper

Quick (10s) → Standard (60s) → Expert (3min). The tier shows only what the previous tier's answers actually require. If the user gave a 😍, no one asks for a rationale. If they gave a 😡, the form *automatically* opens the Standard tier and surfaces concern chips and hazard categories. **The form gets out of the user's way.**

### 3.2 Anchor only the extremes - research-backed

[Yamauchi, Yano & Oyamada (2025)](https://arxiv.org/html/2506.13639v1) showed that when an LLM judge has anchors for scores 1 and 5 only, human-alignment matches the version with all five anchors. We mirror this in the human UI: short anchors at "Ignored the ask" / "Followed exactly," nothing in between. Less reading, equal signal.

### 3.3 Plain language - no jargon at Tier 0

Tier 0 asks "How did this AI response land?" not "Rate the response on the helpfulness dimension." [NN/G's principle of clarity](https://www.nngroup.com/articles/4-principles-reduce-cognitive-load/): forms should read at a 6th-to-8th grade level. Specialist vocabulary appears only in Tier 2 - and only for users who clicked through.

### 3.4 Conversational ordering

Familiarity → priority → dependency → complexity → sensitivity (the NN/G order). We ask "How did it feel?" before "What modality?" Demographics come last. The form mimics the conversation a senior reviewer has with a junior - not the schema a database expects.

### 3.5 Required rationale only when the score warrants it

A 5 doesn't need explaining. A 3 does. Mandatory free-text on low scores is the single biggest lever for inter-rater reliability - but only when it's *triggered*, not constant. Constant requirements train users to write "n/a" and move on.

### 3.6 Templates per market, schema common to all

The same JSON schema underlies the foundation-model template, the agent template, the robotics template, and the SaaS feature template. A rater filling out the robotics form never sees fields about token usage. A rater filling out the agent form never sees fields about perceptual quality. **The schema is universal; the surface is per-market.**

---

## 4. The Eval Army - the layer no one else has built

Tooling commodifies. Workforce doesn't.

The other half of the bet is a **trained, certified, persistent evaluator workforce** - the Eval Army, L1 through L5. L1 is "you completed an onboarding eval suite at >0.6 kappa with our gold raters." L5 is "you can adjudicate disagreements across multiple modalities, set rubrics, and supervise others." The credential is portable.

This solves three problems no labeling-platform vendor solves:

1. **Calibration is durable.** When the same human passes the same gold items every cycle, drift is detectable. When the same human moves between contracts (foundation-model eval today, robotics eval next quarter), their rubric-shared mental model goes with them.
2. **Hybrid mode actually works.** AI judges are noisy; human raters are expensive. A *certified* human in the loop validates the LLM's draft at a fraction of pure-human cost - but only if the human is reliable. Surge AI has shown this commercially for RLHF; nobody has built the public certification standard that makes this hireable at scale.
3. **The market trusts a name.** "Reviewed by L4-certified EvalQA raters" reads on the box like "USDA Organic" reads on produce. The credential is the moat.

---

## 5. Concrete go-to-market - six markets, six templates, one schema

Each template ships with: tuned headline copy, default modality, the dimensions surfaced in Standard, the chips offered at the Quick tier, and the expert-tier sections that auto-open vs. stay collapsed.

### 5.1 Foundation model (default)

Target: AI labs, model providers, internal LLM platform teams.
Quick headline: *"How did the model do?"*
Standard dims: instruction-following, faithfulness, helpfulness, safety.
Hazards: AILuminate v1.0 - 12 categories, default non-violating.
Pricing wedge: per-eval, hybrid cheaper than pure-human.

### 5.2 AI agent / tool-use

Target: Anthropic-, OpenAI-, and Cursor-stack teams; agent startups; coding-agent vendors.
Quick headline: *"Did the agent complete the job?"*
Standard dims: outcome → instruction → plan → safety.
Expert-tier auto-opens: agent extension (task success, tool calls, plan quality, error recovery, pass@k / pass^k).
Differentiator: transcript-aware. Inspect AI is the framework; we are the *human-grade scoring* on top.

### 5.3 RAG / knowledge assistant

Target: enterprise search vendors, customer-support copilots, doc-Q&A products, legal/healthcare/finance copilots.
Quick headline: *"Is the answer grounded in the sources?"*
Standard dims: faithfulness → citation accuracy → answer relevance → context precision/recall.
Expert section auto-open: unsupported-claim list with snippet extraction.
Differentiator: Ragas-aligned metrics built in, but human-graded - Ragas-style automated metrics are noisy on edge cases.

### 5.4 Robotics & embodied AI

Target: Figure, 1X, Skild, Physical Intelligence, surgical-robotics vendors, drone autonomy startups.
Quick headline: *"How did the robot do?"*
Standard dims: task completion → perception accuracy → control safety → recovery from contact.
Hazards: shifted toward physical-world risk - collision, unsafe force, payload misuse, exclusion-zone breach. AILuminate's specialized-advice category extends to *physical safety-critical advice*.
Expert section auto-open: hazards (physical-world risk weighs more than abstract chat hazards).
Differentiator: **nobody else does this.** The eval-for-robotics market is wide-open white space.

### 5.5 SaaS / in-app AI feature

Target: product managers at any B2B SaaS shipping an AI feature.
Quick headline: *"Would you ship this answer to a customer?"*
Standard dims: helpfulness → instruction (only).
Expert: hidden by default - most PMs never need it.
Pricing wedge: pay-per-month, embed our form as a `<iframe>` in your internal review tool.
Differentiator: minimal-friction. Eval becomes a sprint ritual, not a research project.

### 5.6 End-user feedback

Target: production AI products that want sub-5-second user feedback.
Quick headline: *"Was this helpful?"*
One-click thumbs. Optional one-line rationale. That's it.
This is the template that **feeds the data flywheel** - the highest-volume, lowest-fidelity signal that calibrates everything else.

---

## 6. Why this works - five durable advantages

1. **Single schema, many UIs.** As markets evolve, we ship a new template, not a new platform. Schema versioning is the only contract we have to maintain.
2. **Human + AI hybrid by default.** Most competitors pick one. We default to both, and surface AI-vs-human deltas as a feature, not a bug.
3. **Certification as a moat.** Tools are copyable. A credentialed workforce that has passed a public benchmark is not.
4. **Embeddability.** The form is a single HTML file that POSTs JSON to a public endpoint. It drops into anyone's CI, anyone's internal admin, anyone's Slack workflow.
5. **Time-to-first-rating < 10 seconds.** Every existing tool requires onboarding. We require a click.

---

## 7. The roadmap

| Phase | Months | What ships |
|---|---|---|
| **0 - Now** | 0 | Demo form (`demo/aif/`), six templates, schema, append-only JSONL store, basic dashboard stub. |
| **1 - Foundation + agents** | 1 - 3 | Production deploy on `eval.qa`. LLM-judge API documented. Inspect AI / OpenAI Evals importers. L1 certification exam live. |
| **2 - RAG + SaaS** | 3 - 6 | RAG template extended to citation-snippet UI. SaaS embed widget (`<iframe>` + JS SDK). PMs at 20 design-partner B2B shops. L2 + L3 certification live. |
| **3 - Robotics + multimodal** | 6 - 12 | Robotics template w/ video-trace + state-check UI. Physical hazards taxonomy. Partnership with 2 - 3 robotics shops. L4 certification live. |
| **4 - Eval Army as a network** | 12+ | Cross-market rater allocation. SLA-bound hybrid eval as a service. Public IRR leaderboard for raters. L5 (adjudicator) certification live. |

---

## 8. The first three things to prove

Before we believe our own thesis, three things must be true. If any of them fails, the platform doesn't work.

1. **Templates actually save time.** Measurable: time-to-first-rating < 15s on the foundation template, < 25s on robotics, < 10s on end-user. Track with the form's own timing instrumentation.
2. **Hybrid mode produces κ ≥ human-only on safety calls.** If AI pre-fill + L2 human confirm doesn't match pure L4 human review at κ ≥ 0.8, the cost-savings narrative collapses.
3. **At least one non-foundation-model market signs a design partner.** A robotics shop, a B2B SaaS, or a knowledge-work team. If after three months we're still 100% foundation-model customers, the "eval anything" thesis is wrong and we should pivot back to LLM-eval.

---

## Appendix - primary sources

This document was assembled from the following sources, weighted toward 2025 - 2026 publications.

- Anthropic - [Demystifying evals for AI agents](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents)
- MLCommons - [AILuminate v1.0 paper (arXiv:2503.05731)](https://arxiv.org/abs/2503.05731), [Safety Methodology](https://mlcommons.org/ailuminate/safety-methodology/)
- Label Studio - [Building the New Human Evaluation Layer for AI and Agentic Systems](https://labelstud.io/blog/new-evaluation-engine/) (Jan 2026)
- Yamauchi, Yano & Oyamada - [An Empirical Study of LLM-as-a-Judge](https://arxiv.org/html/2506.13639v1) (NEC / U Tokyo, 2025)
- Nielsen Norman Group - [Few Guesses, More Success: 4 Principles to Reduce Cognitive Load in Forms](https://www.nngroup.com/articles/4-principles-reduce-cognitive-load/) (July 2025)
- Hamel Husain & JJ Allaire - [Inspect AI, An OSS Python Library For LLM Evals](https://hamel.dev/notes/llm/evals/inspect.html)
- UK AISI - [Inspect AI documentation](https://inspect.aisi.org.uk/)
- Confident AI - [10 Best AI Evaluation Tools for 2026](https://www.confident-ai.com/knowledge-base/compare/best-ai-evaluation-tools-2026)
- Braintrust - [5 best AI evaluation tools for AI systems in production (2026)](https://www.braintrust.dev/articles/best-ai-evaluation-tools-2026)
- MLflow - [Top 5 Agent Evaluation Tools in 2026](https://mlflow.org/top-5-agent-evaluation-frameworks/)
- Argilla - [argilla-io/argilla on GitHub](https://github.com/argilla-io/argilla)
- Surge AI / Anthropic - [Surge case study](https://surgehq.ai/blog/anthropic-surge-ai-rlhf-platform-train-llm-assistant-human-feedback)
- DeepEval - [confident-ai/deepeval](https://github.com/confident-ai/deepeval)

Internal references inside this repo:
- [`PLAN.md`](./PLAN.md) - the schema-design plan
- [`eval-form.schema.json`](./eval-form.schema.json) - the contract
- [`form.html`](./form.html) - the deployable form