EvalQA is the evaluation layer for AI agents, AI-powered apps, and knowledge work. It combines trained human raters (the Eval Army), automated metrics, and L1 - L5 certification into a single platform.

How do I join the Eval Army?

Create a profile at eval.qa/demo/aif/rater.html, pick your specialties, and take the L1 calibration exam. Pass at κ ≥ 0.6 against gold raters and you're in. You get paid every Friday based on per-eval rate scaling with your level.

How do businesses hire raters from the Eval Army?

Request a pilot at eval.qa/business. We scope your domain in 24 hours, run a hundred-eval pilot in a week, then move to production with embedded forms or API calls plus continuous calibration.

How fast can I try the eval form?

Under fifteen seconds. Open eval.qa/demo/aif/form.html, pick a template, click one of the five emoji ratings, and submit. The form expands to Standard and Expert tiers only if you flag a concern.

How do I embed the eval form in my SaaS or AI product?

Add one script tag (https://eval.qa/embed.js), drop a target div, and call EvalQA.embed({container, template, taskId, prompt, onSave}). The form runs cross-origin in an iframe and posts back via postMessage.

What templates does the form support?

Six market templates: foundation model, AI agent, RAG / knowledge assistant, robotics / embodied AI, in-app SaaS feature, and end-user feedback. All write to the same JSON Schema so data is comparable across markets.

Get started with EvalQA - Join Eval Army, hire raters, try the form, embed the SDK

PATH 01 🛡️ For raters

Join the Eval Army.

A domain-trained, certified workforce that evaluates AI. Pass the L1 exam this week, start grading next week, get paid every Friday.

Onboard in under 30 minutes - a guided exam, no résumé required.
L1 → L5 progression - public certification with a portable credential.
Pick your specialty: foundation models, agents, RAG, robotics, SaaS, safety, code, multilingual.
Weekly payouts. Training time is compensated.
Work asynchronously, from anywhere, as much or as little as you want.

Create your profile

Name + email + specialties. Two minutes. rater.html
Take the L1 calibration exam

Twenty gold-standard items across your specialties. Pass at κ ≥ 0.6 vs. the gold raters and you're in.
Get matched to a contract

Foundation labs, agent startups, robotics shops, SaaS PMs - your specialties decide what's offered.
Eval, get scored, climb

Every eval is calibrated against gold items. Hit L2 in weeks, L3 in months. Adjudicators (L5) set the rubrics.
Get paid Fridays

Per-eval rate scales with your level. Disputes resolved by L5 adjudicators.

Create your profile Learn about the Army

JD

Jane Doe

[email protected]

L4 Senior

Specialties

🧠 Foundation 📚 RAG 📦 SaaS

142

evals this month

4.6

avg overall

0.81

κ vs gold

PATH 02 🏢 For businesses

Hire trained raters for your AI.

Shipping an LLM-powered feature without an eval layer is shipping blind. We bring the trained crew, the rubrics, the schema, and the dashboard - you bring the AI.

Acme Copilot · last 7 days

live

1,284

evals

4.3

avg overall

12

violating

DAILY VIOLATING

↑ trending up - review surface

Certified L3+ raters by default. White-glove onboarding for enterprise.
Custom rubrics for your domain - we'll workshop them with you.
Hybrid mode: LLM judge pre-fills, human confirms. ~10× cheaper than pure-human at higher reliability than pure-AI.
Live dashboard with calibration deltas, IRR (Krippendorff's α), drift alerts.
Data residency: self-host the form + storage on your own infra.

Tell us what you ship

Foundation model, agent, RAG, robotics policy, SaaS feature - pick a template, or we'll design a custom one.
Scoping call (24 hours)

Volume, domains, sensitivity, SLAs. We turn it into a spec the same day.
Pilot - your AI, our raters

Hundred-eval pilot in a week. You see the data, the deltas, the dashboard.
Production

Embed the form, hit the API, or pipe trace logs. Continuous calibration, weekly digest.

Request a pilot Read the vision First reply within 24 hours.

PATH 03 🧪 Try it now

Fill an eval form. Ten seconds flat.

No signup. Open the form, click an emoji, you're done. The form expands only if you flag something - gold-standard progressive disclosure backed by NN/G cognitive-load research.

Six market templates: foundation, agent, RAG, robotics, SaaS, end-user.
Quick (10s) → Standard (60s) → Expert (3min). You choose how deep to go.
AILuminate v1.0 safety taxonomy ships by default - 12 hazard categories.
Rationale required only when score ≤ 3. Top scores stay one-click.
Hybrid mode: an LLM pre-fills, you confirm or override.

Pick a template

Foundation, agent, RAG, robotics, SaaS, end-user, or universal.
Rate the response - one emoji

😡 😕 😐 🙂 😍. Optional one-line note. Submit. You're done if you want to be.
Go deeper if it matters

Standard tier reveals 4 dimensions with behavioral anchors. Expert tier opens AILuminate safety + modality details.
See it land on the dashboard

Real-time, with calibration deltas if multiple raters scored the same task.

Try the foundation template All six templates

How did the model do?

One read. Pick what stood out.

😡

Terrible

😕

Weak

😐

OK

🙂

Good

😍

Great

Press 1 - 5 · Tab for next field

Helpful Accurate Safe On-brief

PATH 04 ⚙️ For developers

Install the adapter. Ship in a sprint.

One script tag, one element, one callback. EvalQA renders inside your product via iframe + postMessage. Same schema as the standalone form - same dashboard surfaces every eval, no matter who filled it.

your-product / review-pane.html

<!-- 1. Load the SDK -->
<script src="https://eval.qa/embed.js"></script>

<!-- 2. Drop a target div -->
<div id="eval-here"></div>

<!-- 3. Render -->
<script>
  EvalQA.embed({
    container: "#eval-here",
    template:  "saas",
    taskId:    "draft-9012",
    prompt:    user.lastMessage,
    raterId:   currentUser.email,
    onSave: ({ eval_id }) => {
      fetch("/api/drafts/9012/eval", {
        method: "POST",
        body:   JSON.stringify({ eval_id })
      });
    }
  });
</script>

One JS file. No npm, no bundler, no build step - works inside any product.
iframe-isolated. Your DOM, cookies, and CSP stay yours.
postMessage protocol: eval:saved, eval:resize, eval:close.
Same six templates, including enduser for one-click production feedback.
Drop-in for LLM judges too: EvalQA.postEval(payload) skips the UI.

Add the script tag

One line, anywhere in the page - head or before </body>.
Drop a target div

Anywhere you want the form rendered. Auto-sizes to content.
Call EvalQA.embed(...)

Template + task ID + prompt + optional rater. That's it.
Wire onSave

Persist the eval_id in your DB, fire a webhook, gate a deploy - whatever your flow needs.

Open the integration docs View embed.js ~5 KB minified.

Four paths in.
Pick yours.

Join the Eval Army.

Hire trained raters for your AI.

Fill an eval form. Ten seconds flat.

Install the adapter. Ship in a sprint.

Stop shipping blind.

Four paths in.Pick yours.

Join the Eval Army.

Hire trained raters for your AI.

Fill an eval form. Ten seconds flat.

Install the adapter. Ship in a sprint.

Stop shipping blind.

Four paths in.
Pick yours.