# AIF - AI Foundational-Model Eval Form

Drop-in evaluation form pipeline for the EvalQA platform.

**Three pillars, six market templates, one schema, ten-second start.**

The platform stands on three legs:

1. **Who is rating** - every eval carries an Eval Army rater profile (name, L1 - L5 certification, specialties), so calibration drift and inter-rater reliability are measurable. → [`rater.html`](./rater.html)
2. **How to integrate** - a single-tag JS SDK (`embed.js`) drops the form into any SaaS or AI product via iframe + postMessage. → [`integrate.html`](./integrate.html)
3. **The form itself** - six market templates, progressive 3-tier disclosure, schema-driven. → [`form.html`](./form.html)

> Why this exists, in two paragraphs, then everything else:

The 2026 eval landscape is full of LLM-only frameworks (Inspect AI, OpenAI Evals, MLflow Evaluate, DeepEval, Braintrust, LangSmith, Galileo, Confident AI). They all evaluate the same shape of thing - a model output, judged by a model judge or a labeling-tool annotator. Meanwhile AI has escaped the chat box into agents, robots, SaaS features, and human-AI hybrid work, and the eval tooling hasn't followed.

AIF is the front-end of EvalQA's "eval anything" thesis: **one schema, many market-tuned forms, hybrid human+AI grading by default, certified rater workforce on top.** Foundation models today; robotics, SaaS features, and embodied AI next.

---

## What's in this folder

```
demo/aif/
├── README.md                ← you are here
├── PLAN.md                  ← schema-design research plan (sources cited)
├── VISION.md                ← "Eval Anything" vision doc
├── eval-form.schema.json    ← the JSON Schema - single source of truth
│
├── form.html                ← Pillar 3 - the eval form (six templates)
├── rater.html               ← Pillar 1 - Eval Army profile / sign-in
├── integrate.html           ← Pillar 2 - SDK docs + live embed demo
├── embed.js                 ← Pillar 2 - single-file JS SDK for SaaS embedding
│
├── dashboard.php            ← server-rendered results overview
├── api/
│   ├── save_eval.php        ← POST eval, JSONL append, validates
│   ├── get_evals.php        ← GET evals for dashboard / hybrid prefill
│   ├── save_rater.php       ← POST Eval Army profile
│   ├── get_rater.php        ← GET rater profile / roster
│   └── schema.php           ← serves the schema with proper headers
└── data/
    ├── evals.jsonl          ← append-only eval store
    └── raters.jsonl         ← append-only rater profile store
```

## Quick start

### Pillar 1 - sign in as an Eval Army rater

```
open https://eval.qa/demo/aif/rater.html
```

Fill in name + email + certification level (L1 - L5) + specialties. Profile is stored in `data/raters.jsonl` and cached in localStorage; every subsequent eval automatically carries your `rater_id`. Anonymous submissions remain allowed but won't contribute to your IRR track record.

### Pillar 2 - embed in your SaaS / AI product

```html
<script src="https://eval.qa/demo/aif/embed.js"></script>
<div id="eval-here"></div>
<script>
  EvalQA.embed({
    container: "#eval-here",
    template: "saas",
    taskId: "ticket-9012",
    prompt: "Customer asked: ...",
    raterId: "jane@acme.com",
    onSave: ({ eval_id }) => console.log("saved", eval_id)
  });
</script>
```

See [`integrate.html`](./integrate.html) for the live demo, full API reference, postMessage protocol, and recipes for: SaaS in-app review, agent CI gate, hybrid review queue, end-user thumbs feedback.

### Pillar 3 - try the form (human rater)

```
open https://eval.qa/demo/aif/form.html
```

You'll land on the template chooser. Pick "Foundation model" (the default) or any of the six markets. Submit your first eval in under 15 seconds.

### Deploy from an LLM judge (AI auto-fill)

POST a JSON payload conforming to `eval-form.schema.json` directly:

```bash
curl -X POST https://eval.qa/demo/aif/api/save_eval.php \
  -H 'Content-Type: application/json' \
  -H 'X-Rater-Type: llm_judge' \
  -H 'X-Rater-Model: claude-sonnet-4-6' \
  -d @your-eval.json
```

Recommended LLM-judge defaults (from [Yamauchi et al. 2025](https://arxiv.org/html/2506.13639v1)):
- Sample N=5 with mean aggregation (`rater.n_samples: 5`).
- Provide both reference answer + score anchors (anchors for 1 and 5 only - intermediate anchors don't add signal).
- CoT reasoning optional when anchors are present.
- Use one judge call per dimension when stakes are high; combined is noisier.

### Hybrid review (AI pre-fill → human confirm)

```
open https://eval.qa/demo/aif/form.html?eval_id=<existing-id>&mode=review
```

Loads the existing LLM-judge eval, highlights pre-filled fields yellow, lets a human accept / override / annotate. Saved as a *new* row with `derived_from` pointing back. The dashboard shows the human-vs-LLM delta automatically.

### See results

```
open https://eval.qa/demo/aif/dashboard.php
```

Total evals, by rater type, by modality, recent table, calibration deltas where both human and LLM-judge scored the same task.

---

## The six market templates

URL: `form.html?template=<name>`. Each template tunes headline, default modality, surfaced dimensions, chips, and which expert sections auto-open. **Same schema underneath** - outputs are comparable across markets.

| Template | URL slug | For |
|---|---|---|
| Foundation model | `?template=foundation` | LLM labs, model providers, internal LLM platform teams |
| AI agent / tool-use | `?template=agent` | Coding agents, browser agents, Anthropic/OpenAI-stack teams |
| RAG / knowledge | `?template=rag` | Enterprise search, customer-support copilots, doc Q&A |
| Robotics / embodied | `?template=robotics` | Manipulation, navigation, surgical robotics, drone autonomy |
| SaaS / in-app AI | `?template=saas` | PMs shipping AI features inside B2B products |
| End-user feedback | `?template=enduser` | Production thumbs + rationale, ten-second feedback |
| Universal | `?template=universal` | Everything - all fields opt-in |

Add a market by editing `TEMPLATES` in `form.html` (one block per market). No schema change required.

---

## Design principles

1. **Schema-first.** Same JSON Schema for AI and human fillers. AI POSTs JSON; HTML form serializes to identical JSON.
2. **Progressive 3-tier disclosure.** Quick (~10s) → Standard (~60s) → Expert (~3min). Each tier expands only if the previous tier surfaced a concern. Time-to-first-rating < 15 seconds.
3. **Anchor extremes only.** Per Yamauchi 2025 - anchoring 1 and 5 only gives the same alignment as anchoring all five. Less reading, equal signal.
4. **Mandatory rationale on low scores only.** Rationale required when score ≤ 3 or any violating-hazard verdict. Constant requirements train users to write "n/a."
5. **AILuminate-aligned safety.** 12 hazard categories ship by default. Modality templates extend with physical-world risk for robotics.
6. **Hybrid mode is first-class.** `mode=review` is a documented flow, not an exception.

See [PLAN.md](./PLAN.md) for the full research synthesis with citations and [VISION.md](./VISION.md) for the broader thesis.

---

## How to extend

**Add a market template:** edit the `TEMPLATES` object in `form.html`. Provide headline, default modality, which Standard dimensions to surface, which chips to use. The schema doesn't change.

**Add a hazard or issue tag:** update both `eval-form.schema.json` (enum) and the `HAZARDS` / `FULL_TAGS` constants in `form.html`. Run a dashboard refresh to verify it renders.

**Add a modality extension:** the schema already defines `chat`, `agent`, `rag`, `multimodal`. Add a new `$defs` entry and a top-level optional property; add a render block in `renderModExtension()` in `form.html`.

**Swap storage:** `data/evals.jsonl` is deliberately the simplest possible store. To migrate to Postgres / DuckDB, write a tail-and-load job - the JSONL records are already in your target row shape.

---

## Status & limitations

This is a v1 demo, not production. Explicitly **not** included:

- Full JSON Schema validation server-side (hand-written checks only - add `opis/json-schema` for production).
- Auth - the API trusts the network. Wire into the existing `blog/login.php` before public deploy.
- Schema version migrations - JSONL tolerates drift but production needs `schema_version` enforcement and migrators.
- Real IRR statistics - the dashboard shows mean absolute delta as a placeholder. Implement Krippendorff's α for production (handles >2 raters and mixed scales).
- Multilingual UI - English only. AILuminate ships Hindi & French prompts; the schema is language-agnostic.

---

## License

MIT - same as the rest of the EvalQA repo.
