# AIF - AI Foundational-Model Eval Form Drop-in evaluation form pipeline for the EvalQA platform. **Three pillars, six market templates, one schema, ten-second start.** The platform stands on three legs: 1. **Who is rating** - every eval carries an Eval Army rater profile (name, L1 - L5 certification, specialties), so calibration drift and inter-rater reliability are measurable. → [`rater.html`](./rater.html) 2. **How to integrate** - a single-tag JS SDK (`embed.js`) drops the form into any SaaS or AI product via iframe + postMessage. → [`integrate.html`](./integrate.html) 3. **The form itself** - six market templates, progressive 3-tier disclosure, schema-driven. → [`form.html`](./form.html) > Why this exists, in two paragraphs, then everything else: The 2026 eval landscape is full of LLM-only frameworks (Inspect AI, OpenAI Evals, MLflow Evaluate, DeepEval, Braintrust, LangSmith, Galileo, Confident AI). They all evaluate the same shape of thing - a model output, judged by a model judge or a labeling-tool annotator. Meanwhile AI has escaped the chat box into agents, robots, SaaS features, and human-AI hybrid work, and the eval tooling hasn't followed. AIF is the front-end of EvalQA's "eval anything" thesis: **one schema, many market-tuned forms, hybrid human+AI grading by default, certified rater workforce on top.** Foundation models today; robotics, SaaS features, and embodied AI next. --- ## What's in this folder ``` demo/aif/ ├── README.md ← you are here ├── PLAN.md ← schema-design research plan (sources cited) ├── VISION.md ← "Eval Anything" vision doc ├── eval-form.schema.json ← the JSON Schema - single source of truth │ ├── form.html ← Pillar 3 - the eval form (six templates) ├── rater.html ← Pillar 1 - Eval Army profile / sign-in ├── integrate.html ← Pillar 2 - SDK docs + live embed demo ├── embed.js ← Pillar 2 - single-file JS SDK for SaaS embedding │ ├── dashboard.php ← server-rendered results overview ├── api/ │ ├── save_eval.php ← POST eval, JSONL append, validates │ ├── get_evals.php ← GET evals for dashboard / hybrid prefill │ ├── save_rater.php ← POST Eval Army profile │ ├── get_rater.php ← GET rater profile / roster │ └── schema.php ← serves the schema with proper headers └── data/ ├── evals.jsonl ← append-only eval store └── raters.jsonl ← append-only rater profile store ``` ## Quick start ### Pillar 1 - sign in as an Eval Army rater ``` open https://eval.qa/demo/aif/rater.html ``` Fill in name + email + certification level (L1 - L5) + specialties. Profile is stored in `data/raters.jsonl` and cached in localStorage; every subsequent eval automatically carries your `rater_id`. Anonymous submissions remain allowed but won't contribute to your IRR track record. ### Pillar 2 - embed in your SaaS / AI product ```html

``` See [`integrate.html`](./integrate.html) for the live demo, full API reference, postMessage protocol, and recipes for: SaaS in-app review, agent CI gate, hybrid review queue, end-user thumbs feedback. ### Pillar 3 - try the form (human rater) ``` open https://eval.qa/demo/aif/form.html ``` You'll land on the template chooser. Pick "Foundation model" (the default) or any of the six markets. Submit your first eval in under 15 seconds. ### Deploy from an LLM judge (AI auto-fill) POST a JSON payload conforming to `eval-form.schema.json` directly: ```bash curl -X POST https://eval.qa/demo/aif/api/save_eval.php \ -H 'Content-Type: application/json' \ -H 'X-Rater-Type: llm_judge' \ -H 'X-Rater-Model: claude-sonnet-4-6' \ -d @your-eval.json ``` Recommended LLM-judge defaults (from [Yamauchi et al. 2025](https://arxiv.org/html/2506.13639v1)): - Sample N=5 with mean aggregation (`rater.n_samples: 5`). - Provide both reference answer + score anchors (anchors for 1 and 5 only - intermediate anchors don't add signal). - CoT reasoning optional when anchors are present. - Use one judge call per dimension when stakes are high; combined is noisier. ### Hybrid review (AI pre-fill → human confirm) ``` open https://eval.qa/demo/aif/form.html?eval_id=&mode=review ``` Loads the existing LLM-judge eval, highlights pre-filled fields yellow, lets a human accept / override / annotate. Saved as a *new* row with `derived_from` pointing back. The dashboard shows the human-vs-LLM delta automatically. ### See results ``` open https://eval.qa/demo/aif/dashboard.php ``` Total evals, by rater type, by modality, recent table, calibration deltas where both human and LLM-judge scored the same task. --- ## The six market templates URL: `form.html?template=`. Each template tunes headline, default modality, surfaced dimensions, chips, and which expert sections auto-open. **Same schema underneath** - outputs are comparable across markets. | Template | URL slug | For | |---|---|---| | Foundation model | `?template=foundation` | LLM labs, model providers, internal LLM platform teams | | AI agent / tool-use | `?template=agent` | Coding agents, browser agents, Anthropic/OpenAI-stack teams | | RAG / knowledge | `?template=rag` | Enterprise search, customer-support copilots, doc Q&A | | Robotics / embodied | `?template=robotics` | Manipulation, navigation, surgical robotics, drone autonomy | | SaaS / in-app AI | `?template=saas` | PMs shipping AI features inside B2B products | | End-user feedback | `?template=enduser` | Production thumbs + rationale, ten-second feedback | | Universal | `?template=universal` | Everything - all fields opt-in | Add a market by editing `TEMPLATES` in `form.html` (one block per market). No schema change required. --- ## Design principles 1. **Schema-first.** Same JSON Schema for AI and human fillers. AI POSTs JSON; HTML form serializes to identical JSON. 2. **Progressive 3-tier disclosure.** Quick (~10s) → Standard (~60s) → Expert (~3min). Each tier expands only if the previous tier surfaced a concern. Time-to-first-rating < 15 seconds. 3. **Anchor extremes only.** Per Yamauchi 2025 - anchoring 1 and 5 only gives the same alignment as anchoring all five. Less reading, equal signal. 4. **Mandatory rationale on low scores only.** Rationale required when score ≤ 3 or any violating-hazard verdict. Constant requirements train users to write "n/a." 5. **AILuminate-aligned safety.** 12 hazard categories ship by default. Modality templates extend with physical-world risk for robotics. 6. **Hybrid mode is first-class.** `mode=review` is a documented flow, not an exception. See [PLAN.md](./PLAN.md) for the full research synthesis with citations and [VISION.md](./VISION.md) for the broader thesis. --- ## How to extend **Add a market template:** edit the `TEMPLATES` object in `form.html`. Provide headline, default modality, which Standard dimensions to surface, which chips to use. The schema doesn't change. **Add a hazard or issue tag:** update both `eval-form.schema.json` (enum) and the `HAZARDS` / `FULL_TAGS` constants in `form.html`. Run a dashboard refresh to verify it renders. **Add a modality extension:** the schema already defines `chat`, `agent`, `rag`, `multimodal`. Add a new `$defs` entry and a top-level optional property; add a render block in `renderModExtension()` in `form.html`. **Swap storage:** `data/evals.jsonl` is deliberately the simplest possible store. To migrate to Postgres / DuckDB, write a tail-and-load job - the JSONL records are already in your target row shape. --- ## Status & limitations This is a v1 demo, not production. Explicitly **not** included: - Full JSON Schema validation server-side (hand-written checks only - add `opis/json-schema` for production). - Auth - the API trusts the network. Wire into the existing `blog/login.php` before public deploy. - Schema version migrations - JSONL tolerates drift but production needs `schema_version` enforcement and migrators. - Real IRR statistics - the dashboard shows mean absolute delta as a placeholder. Implement Krippendorff's α for production (handles >2 raters and mixed scales). - Multilingual UI - English only. AILuminate ships Hindi & French prompts; the schema is language-agnostic. --- ## License MIT - same as the rest of the EvalQA repo.