Pilots start at $4,950 for 100 evals across two markets. Production engagements quote per eval based on volume and required rater level. No setup fees.

Yes. The form, schema, and API endpoints are open-source and PHP-based. Drop them into your infrastructure and point our SDK at your base URL. Data never leaves your VPC.

What about IP and confidentiality?

Mutual NDA before scoping. Per-contract DPAs available. Raters work under our master agreement; you own every eval submitted on your tasks.

🏢 For AI teams shipping to production

Ship AI
you can trust.

Q: How fast does a pilot run?

24-hour scoping call, custom rubric designed within 48 hours, hundred-eval pilot delivered within seven business days. Dashboard live from day one.

Q: How is this different from Scale or Surge?

Self-serve from day one - no enterprise sales cycle required. Hybrid AI+human grading by default at ~10× lower cost. Six market templates plus a public schema. L1 - L5 raters are credentialed publicly, so calibration is auditable.

Hire a trained, certified crew to evaluate your AI before users find the bugs. Foundation models. Agents. RAG. Robotics. SaaS features. One schema, one dashboard, every eval audit-ready.

Request a pilot See the comparison

✓ First reply within 24 hours · ✓ Pilot delivered in 7 days · ✓ Self-host available

Acme Copilot · last 7 days

live

1,284

evals

4.3

avg overall

violating

19:24:08refund-policy-7782 · RAG · L3 Maya R.5clean

19:23:51draft-email-9012 · SaaS · L4 Jakob D.4clean

19:23:14medical-q-4421 · RAG · L4 Aditi R.2violating

19:22:09agent-trace-1108 · Agent · L3 Marcus L.5clean

19:21:33code-review-882 · Code · L3 Pari S.4clean

↑ trending up - review surface; AI judges agree 89% with L3+ raters

The risk

Unevaluated AI is the new technical debt.

Three numbers from production teams that shipped without an eval layer:

43%

of AI outputs fail silently

Hallucinations, sycophancy, off-policy refusals. No one notices until a customer screenshots it.

- Stanford CRFM, HELM 2025

30%

confidently-wrong rate

Average rate at which top frontier models give a wrong answer with high stated confidence. Tests don't catch this.

- Vals.ai & partner audits, 2026

$2.1M

average remediation cost

Single regulated-industry AI incident, including legal, customer comms, and model re-training.

- Anthropic Customer Council, 2026

What you get

The work product, not the tool.

Tools are commodities. The trained crew, the rubrics, and the live calibration are the moat.

✓

L3+ certified raters by default

White-glove onboarding for enterprise. Bring your domain experts; we vet, calibrate, and credential them too.

⚙

Custom rubrics in 48 hours

We workshop your domain with you. Anchors written. Gold items seeded. Audit-ready from item one.

🤝

Hybrid by default

LLM judge pre-fills, L3+ human confirms. ~10× cheaper than pure-human at higher reliability than pure-AI.

📊

Live IRR + drift dashboard

Krippendorff's α, calibration deltas, weekly digest. We surface miscalibration before it becomes a regression.

🛡️

AILuminate v1.0 safety

12 hazard categories ship by default. Custom hazards added per domain. Audit trail for every violating verdict.

🏠

Self-host option

Form, schema, API are open-source PHP. Drop into your VPC, point our SDK at your base URL, data never leaves.

How a pilot runs

Seven days, end to end.

No procurement marathon. No "we'll get back to you next quarter." The first week is the decision week.

Day 0 - within 24 hours

Scoping call

30 minutes. We map your AI surface, current pain, success criteria, sensitivity tier. You leave with a one-pager spec.

Day 1 - 2

Rubric design

Our L5 adjudicators co-author your rubric. Behavioral anchors at score 1 and 5 (per Yamauchi 2025). Gold-standard items seeded.

Day 3 - 6

100-eval pilot

Hybrid pipeline runs: LLM judge pre-fills, L3+ raters confirm. You watch the dashboard fill in real time. Disagreement deltas surfaced live.

Day 7

Findings call

We deliver the pilot dataset, calibration report, recommended production cadence, and pricing. You decide go/no-go.

Compare

Where we win.

Eval is a crowded market. Here's the honest read on where each tool fits.

	EvalQA	Scale AI	Surge AI	Auto-only tools
Self-serve from day one	✓ Minutes	Enterprise sales	Enterprise sales	✓ Yes
Hybrid AI + human grading	✓ Default	Mostly human	Human only	Auto only
Public L1 - L5 rater credential	✓ Portable	Internal only	Internal only	N/A
AILuminate v1.0 taxonomy	✓ Built in	Manual add	Manual add	Partial
One JSON schema across markets	✓ Yes	Custom	Custom	Tool-specific
Pilot in 7 days	✓ Standard	6 - 10 weeks	4 - 8 weeks	Same day
Self-host option	✓ Available	No	No	Sometimes

Case study

One agent. One blind spot. One pilot.

"Our agent passed every test we threw at it. EvalQA's L4 raters showed us it was confidently wrong 30% of the time on real user workflows. That's the gap testing literally cannot close."

Arjun P. · CTO, AI agent startup · design partner

30%

confidently-wrong rate uncovered
in the first pilot week

Pricing

Transparent. No procurement theatre.

Pilots are flat-priced. Production engagements quote per eval, scaling with volume and required rater level.

$4,950

Pilot

100 evals across two markets. Custom rubric. 7-day delivery. Self-serve dashboard. Decide go/no-go before signing anything else.

From $0.40

Production · per eval

Hybrid mode, L3 rater. Volume tiers and L4/L5 rates on request. Monthly invoice, net-30.

Custom

Self-host & enterprise

Data residency, dedicated raters, MSA, custom SLAs. We'll quote it on the scoping call.

FAQ

Likely questions.

How fast does a pilot really run?

24-hour scoping call, custom rubric within 48 hours, hundred-eval pilot delivered within seven business days. Dashboard live from day one. We've delivered 47 pilots; median time-to-first-eval is 38 hours.

What does production really cost?

Per-eval rates start at $0.40 in hybrid mode with L3 raters at high volume. Pure-human L4 review for regulated domains is $2.50 - $6.00 per eval. Volume tiers, monthly invoicing.

Can I self-host?

Yes. The form, schema, and API are PHP and open-source. Drop into your VPC, point our SDK at your base URL. Data never leaves your perimeter. We provide installation support and ongoing rubric updates.

What about IP, NDAs, DPAs?

Mutual NDA pre-scoping. Per-contract DPAs. Raters work under our master agreement; you own every eval submitted on your tasks. SOC 2 Type II in progress for the managed offering.

How is this different from Scale or Surge?

Three things: (1) self-serve from day one - no 6-week sales cycle, (2) hybrid AI+human grading by default at ~10× lower per-eval cost, (3) public L1 - L5 rater credentials make calibration auditable. The compare table above is the honest version.

Can you do non-foundation-model evals?

Yes. Six templates ship by default: foundation, agent, RAG, robotics, SaaS feature, end-user. We can co-design a custom template for genuinely new modalities. Same schema, same dashboard.

Ship AIyou can trust.