Now accepting raters, design partners, and embeds

Four paths in.
Pick yours.

EvalQA is the evaluation layer for AI agents, AI apps, and knowledge work. Earn from evaluating it, hire a trained crew to evaluate it, try the form, or embed it - same schema, same data, four ways in.

See the four paths Talk to sales
Self-serve from day one No credit card to start Production-ready in days
01
🛡️
Join eval.army
Get certified. Evaluate AI. Earn weekly. Work from anywhere.
For raters →
02
🏢
Join eval.business
Hire a trained crew to evaluate your AI before users find the bugs.
For teams →
03
🧪
Fill an eval form
Ten seconds to your first rating. Six market templates. See it work.
Try it now →
04
⚙️
Install the adapter
One script tag. One element. Eval inside your SaaS or AI product.
For developers →
10s
to your first eval
6
market templates
L1 - L5
rater certifications
1 tag
to embed anywhere
PATH 01 🛡️ For raters

Join the Eval Army.

A domain-trained, certified workforce that evaluates AI. Pass the L1 exam this week, start grading next week, get paid every Friday.

  • Onboard in under 30 minutes - a guided exam, no résumé required.
  • L1 → L5 progression - public certification with a portable credential.
  • Pick your specialty: foundation models, agents, RAG, robotics, SaaS, safety, code, multilingual.
  • Weekly payouts. Training time is compensated.
  • Work asynchronously, from anywhere, as much or as little as you want.
  1. Create your profile
    Name + email + specialties. Two minutes. rater.html
  2. Take the L1 calibration exam
    Twenty gold-standard items across your specialties. Pass at κ ≥ 0.6 vs. the gold raters and you're in.
  3. Get matched to a contract
    Foundation labs, agent startups, robotics shops, SaaS PMs - your specialties decide what's offered.
  4. Eval, get scored, climb
    Every eval is calibrated against gold items. Hit L2 in weeks, L3 in months. Adjudicators (L5) set the rubrics.
  5. Get paid Fridays
    Per-eval rate scales with your level. Disputes resolved by L5 adjudicators.
JD
L4 Senior
Specialties
🧠 Foundation 📚 RAG 📦 SaaS
142
evals this month
4.6
avg overall
0.81
κ vs gold
PATH 02 🏢 For businesses

Hire trained raters for your AI.

Shipping an LLM-powered feature without an eval layer is shipping blind. We bring the trained crew, the rubrics, the schema, and the dashboard - you bring the AI.

Acme Copilot · last 7 days
live
1,284
evals
4.3
avg overall
12
violating
DAILY VIOLATING
↑ trending up - review surface
  • Certified L3+ raters by default. White-glove onboarding for enterprise.
  • Custom rubrics for your domain - we'll workshop them with you.
  • Hybrid mode: LLM judge pre-fills, human confirms. ~10× cheaper than pure-human at higher reliability than pure-AI.
  • Live dashboard with calibration deltas, IRR (Krippendorff's α), drift alerts.
  • Data residency: self-host the form + storage on your own infra.
  1. Tell us what you ship
    Foundation model, agent, RAG, robotics policy, SaaS feature - pick a template, or we'll design a custom one.
  2. Scoping call (24 hours)
    Volume, domains, sensitivity, SLAs. We turn it into a spec the same day.
  3. Pilot - your AI, our raters
    Hundred-eval pilot in a week. You see the data, the deltas, the dashboard.
  4. Production
    Embed the form, hit the API, or pipe trace logs. Continuous calibration, weekly digest.
Request a pilot Read the vision First reply within 24 hours.
PATH 03 🧪 Try it now

Fill an eval form. Ten seconds flat.

No signup. Open the form, click an emoji, you're done. The form expands only if you flag something - gold-standard progressive disclosure backed by NN/G cognitive-load research.

  • Six market templates: foundation, agent, RAG, robotics, SaaS, end-user.
  • Quick (10s) → Standard (60s) → Expert (3min). You choose how deep to go.
  • AILuminate v1.0 safety taxonomy ships by default - 12 hazard categories.
  • Rationale required only when score ≤ 3. Top scores stay one-click.
  • Hybrid mode: an LLM pre-fills, you confirm or override.
  1. Pick a template
    Foundation, agent, RAG, robotics, SaaS, end-user, or universal.
  2. Rate the response - one emoji
    😡 😕 😐 🙂 😍. Optional one-line note. Submit. You're done if you want to be.
  3. Go deeper if it matters
    Standard tier reveals 4 dimensions with behavioral anchors. Expert tier opens AILuminate safety + modality details.
  4. See it land on the dashboard
    Real-time, with calibration deltas if multiple raters scored the same task.
How did the model do?
One read. Pick what stood out.
Press 1 - 5 · Tab for next field
Helpful Accurate Safe On-brief
PATH 04 ⚙️ For developers

Install the adapter. Ship in a sprint.

One script tag, one element, one callback. EvalQA renders inside your product via iframe + postMessage. Same schema as the standalone form - same dashboard surfaces every eval, no matter who filled it.

your-product / review-pane.html
<!-- 1. Load the SDK -->
<script src="https://eval.qa/embed.js"></script>

<!-- 2. Drop a target div -->
<div id="eval-here"></div>

<!-- 3. Render -->
<script>
  EvalQA.embed({
    container: "#eval-here",
    template:  "saas",
    taskId:    "draft-9012",
    prompt:    user.lastMessage,
    raterId:   currentUser.email,
    onSave: ({ eval_id }) => {
      fetch("/api/drafts/9012/eval", {
        method: "POST",
        body:   JSON.stringify({ eval_id })
      });
    }
  });
</script>
  • One JS file. No npm, no bundler, no build step - works inside any product.
  • iframe-isolated. Your DOM, cookies, and CSP stay yours.
  • postMessage protocol: eval:saved, eval:resize, eval:close.
  • Same six templates, including enduser for one-click production feedback.
  • Drop-in for LLM judges too: EvalQA.postEval(payload) skips the UI.
  1. Add the script tag
    One line, anywhere in the page - head or before </body>.
  2. Drop a target div
    Anywhere you want the form rendered. Auto-sizes to content.
  3. Call EvalQA.embed(...)
    Template + task ID + prompt + optional rater. That's it.
  4. Wire onSave
    Persist the eval_id in your DB, fire a webhook, gate a deploy - whatever your flow needs.

Stop shipping blind.

Pick a path. Start measuring what matters.

Pick a path Talk to sales