Level 5 • Advanced
The Eval Engineering Track: Building AI Quality Infrastructure


The Eval Engineering Track: Building the Infrastructure of AI Quality

Table of Contents
  1. Eval Infrastructure Stack
  2. Core Technical Skills
  3. System Design Interview
  4. Building Eval Tooling
  5. Career Ladder & Promotions
  6. Compensation & Equity

Eval Infrastructure Stack: Storage, Compute, APIs, Observability

The eval engineering track differs fundamentally from general ML engineering. While traditional ML engineers optimize models and training pipelines, eval engineers build the systems that measure quality. This requires deep expertise in data pipelines, distributed systems, statistical computing, and observability—plus unique domain knowledge about evaluation itself. You're not building the AI; you're building the quality assurance infrastructure for AI.

The core eval infrastructure stack consists of four pillars: storage (how eval results are persisted), compute (how eval jobs are scheduled and executed), APIs (how systems query and integrate eval data), and observability (how you monitor and debug eval systems). Each pillar has distinct design challenges.

Eval Results Database Design with Full Schema

Most organizations start with a flat file system (JSON lines, CSVs) for eval results. This works for 5-10 evals. At 100+ evals with millions of examples, it becomes impossible to manage. A proper eval database must support multiple access patterns and maintain data integrity at scale.

Your database must handle:

A reference schema structure for PostgreSQL:

CREATE TABLE evals (
  id UUID PRIMARY KEY,
  name VARCHAR NOT NULL UNIQUE,
  description TEXT,
  eval_type VARCHAR,
  created_at TIMESTAMP,
  updated_at TIMESTAMP,
  owner_team VARCHAR,
  config JSONB,
  status VARCHAR
);

CREATE TABLE eval_runs (
  id UUID PRIMARY KEY,
  eval_id UUID NOT NULL REFERENCES evals(id),
  deployment_id UUID,
  model_version VARCHAR,
  model_config JSONB,
  dataset_name VARCHAR,
  dataset_version VARCHAR,
  dataset_size INT,
  started_at TIMESTAMP,
  completed_at TIMESTAMP,
  duration_seconds INT,
  status VARCHAR,
  error_message TEXT,
  config JSONB,
  metadata JSONB
);

CREATE INDEX ON eval_runs(eval_id, started_at DESC);
CREATE INDEX ON eval_runs(model_version, started_at DESC);

CREATE TABLE eval_results (
  id UUID PRIMARY KEY,
  run_id UUID NOT NULL REFERENCES eval_runs(id),
  example_id VARCHAR NOT NULL,
  metric_name VARCHAR NOT NULL,
  metric_value FLOAT,
  metric_category VARCHAR,
  metadata JSONB,
  created_at TIMESTAMP
);

CREATE INDEX ON eval_results(run_id, metric_name);
CREATE INDEX ON eval_results(example_id);
CREATE INDEX ON eval_results(metric_name, metric_value);

This schema allows complex queries: "Show me all instances where BLEU score is below 0.7 for the past 30 days," or "Compare token-accuracy across model versions for language=Spanish," or "Find the 99th percentile of response times for the support agent eval." The indexes make these queries fast.

For analytical queries and dashboards, use a columnar database or data warehouse (DuckDB, Apache Iceberg, BigQuery, Snowflake) for efficient aggregations and joins. Sync data from PostgreSQL to your warehouse nightly or in real-time using a CDC tool like Kafka or Debezium.

Compute: Eval Job Scheduling Architecture

Eval jobs can take seconds (quick metrics like BLEU) to hours (human annotation studies) to days (large-scale manual evaluation campaigns). You need a job scheduler that handles backpressure, retries, resource allocation, and task dependencies. Most companies use one of: Apache Airflow, Prefect, Argo Workflows, Temporal, or Kubernetes CronJobs with custom controllers.

Key design considerations when choosing or building your eval scheduler:

A production eval scheduling system typically includes: (1) a task queue (Redis, RabbitMQ, or Kafka), (2) workers (containers or serverless functions that execute evals), (3) a state store (what's running, what's done, what failed—usually a database), and (4) a control plane (decides what to schedule next based on dependencies, resources, priority).

Example architecture: Use Airflow DAGs to define eval workflows. Each eval run triggers a DAG. Tasks fan out to compute metrics in parallel. Failed tasks retry with exponential backoff. Results are written to PostgreSQL. Large artifacts go to S3. Monitor task duration and alert if a task takes 10x longer than expected.

Eval Results API Design: Multi-Pattern Access

Once you have results stored, every downstream system needs access. Your API must support multiple distinct access patterns, and optimizing for one pattern means pessimizing for others. You need to understand your users and design accordingly.

Common access patterns:

A RESTful API design:

GET /api/evals/{eval_id}/runs/{run_id}/summary
  → { 
      metric_name: "bleu", 
      value: 0.742, 
      confidence_interval: [0.738, 0.746],
      sample_size: 1000
    }

GET /api/evals/{eval_id}/timeseries?metric=bleu&days=30
  → [
      { date: "2025-02-01", value: 0.725, n: 1000 },
      { date: "2025-02-02", value: 0.728, n: 980 },
      ...
    ]

GET /api/evals/{eval_id}/runs/{run_id}/breakdown?metric=bleu&dimension=language
  → { 
      en: { value: 0.81, n: 500 },
      es: { value: 0.75, n: 300 },
      fr: { value: 0.72, n: 200 }
    }

GET /api/evals/{eval_id}/runs/{run_id}/examples?metric=bleu&percentile=hardest&limit=10
  → [{ 
      example_id: "123", 
      input: "...", 
      prediction: "...", 
      reference: "...",
      metric_value: 0.42,
      metrics: { bleu: 0.42, rouge: 0.38 }
    }, ...]

POST /api/evals/compare
  → {
      runs: [run_1_id, run_2_id],
      metrics: ["bleu", "rouge"],
      results: [
        { metric: "bleu", run_1: 0.74, run_2: 0.79, p_value: 0.002 }
      ]
    }

Key API design principles: (1) Read-optimized: use materialized views or caches for complex queries, (2) Batch support: fetching 1000 metrics should not require 1000 HTTP calls, (3) Versioning: support /v1/, /v2/ endpoints to evolve the API without breaking clients, (4) Filtering: always support filtering by time range, model version, dataset, and other dimensions, (5) Pagination: large result sets should be paginated, (6) Caching: aggressive caching for time-series and summary queries, (7) Rate limiting: to prevent abuse and ensure fair access.

Observability: Logging, Tracing, and Alerting for Eval Systems

When an eval produces unexpected results, you need to understand why. This requires three layers of observability: structured logging, distributed tracing, and automated alerting.

Structured Logging: Don't just log "eval finished." Log: "processed 1000 examples in 45s, 2 examples timed out, 15 examples hit API rate limits, largest example had 500 tokens." Include: checkpoint times (data loading took 5s, metrics took 30s), example-level logs (example 123 failed with error X), error traces with context, resource usage (memory peaked at 4GB), network activity (made 1500 API calls).

logger.info("eval_finished", {
  "eval_id": "abc123",
  "run_id": "xyz789",
  "examples_processed": 1000,
  "duration_seconds": 45,
  "memory_peak_mb": 4096,
  "api_calls": 1500,
  "timeouts": 2,
  "rate_limit_hits": 15,
  "metrics": {"bleu": 0.742, "rouge": 0.691}
})

Distributed Tracing: Trace the flow of one example through the eval pipeline. If an example fails, you want to see exactly where: data loading, model inference, metric computation, result serialization? Include latency at each step. Use OpenTelemetry or Jaeger for this.

# One trace for one example
trace_id: "abc-123-xyz"
  span: data_load (duration: 10ms)
  span: model_inference (duration: 350ms, model=gpt4)
  span: metric_computation (duration: 45ms, metrics=[bleu, rouge])
  span: result_write (duration: 5ms)

Automated Alerting: Define alerts for known failure modes: When eval results regress unexpectedly (e.g., accuracy dropped 20%), when an annotation study stalls (no progress for 2 hours), when eval duration suddenly spikes (10x normal), when rater agreement drops below threshold, when data quality metrics (missing fields, malformed JSON) exceed a threshold. For each alert, define: who gets notified (Slack, PagerDuty), the runbook for fixing it, and the resolution time SLA.

Most teams use: Datadog, New Relic, or Prometheus for metrics; ELK stack (Elasticsearch, Logstash, Kibana) or Grafana Loki for logs; Jaeger, Zipkin, or Datadog APM for traces. The critical requirement is that eval engineers can quickly diagnose problems without having to instrument code manually.

Core Technical Skills for Eval Engineers

What separates a junior eval engineer from a staff-level one? Deep technical skills. You need to be able to implement complex evals from scratch, debug performance bottlenecks, and architect large-scale evaluation systems.

Python for Eval Pipelines

You need deep Python expertise. Not just syntax, but: async/await for concurrent eval execution, type hints for data validation and IDE support, testing (pytest, hypothesis for property testing), profiling (cProfile, memory_profiler) for optimization. Most eval code is Python: data loading, metric computation, annotation parsing, result aggregation.

Skills you should master:

Common libraries and when to use them:

Statistical Testing Libraries and Deep Statistical Knowledge

You must understand statistical significance deeply. Not just "p < 0.05," but effect sizes, multiple comparisons correction, power analysis, assumptions of tests, and when hypothesis tests are appropriate.

Core libraries: SciPy (scipy.stats for t-tests, Mann-Whitney, chi-squared, Kolmogorov-Smirnov), statsmodels (linear regression, ANOVA, Bayesian models), pingouin (comprehensive stats, effect sizes), scikit-posthocs (post-hoc tests for pairwise comparisons).

Questions you should be able to answer:

Data Pipeline Tools: Airflow, Prefect, Argo

You need to orchestrate complex multi-step evals. Tools: Apache Airflow (industry standard, mature, complex), Prefect (more Pythonic, dynamic DAGs, growing adoption), Dagster (data-aware orchestration, great error handling), or Temporal (for long-running workflows with state machines).

Expertise here means: designing DAGs for your eval workflows, handling failures and retries intelligently (exponential backoff, max retries), managing backpressure (don't submit 10K tasks at once, use task pools), monitoring pipeline health (tasks stuck, tasks failing more than expected), debugging data quality issues (why are results missing?), and cost optimization (run expensive tasks in parallel, batch small tasks together).

ML Experiment Tracking: MLflow, Weights & Biases, Neptune

Track which dataset version, which model version, which eval config produced these results. When you change anything, you want to see exactly what changed and what impact it had.

For eval-specific use cases: tracking annotation studies (which raters participated, what was their agreement, inter-rater reliability), tracking eval iterations (how we refined the eval over time, how metrics changed as we fixed bugs), comparing baseline vs. improved versions of metrics, and correlating eval metrics with downstream business metrics.

The Broader Stack

You also need familiarity with: Docker (containerizing eval jobs for reproducibility and scaling), Kubernetes (if running eval infrastructure at scale), SQL (querying your eval database for analysis and debugging), Git (version control for code and configs, understanding CI/CD), and cloud infrastructure (AWS, GCP, or Azure SDKs for data access and compute resources).

Additionally: understanding of distributed systems (what happens when tasks fail? how do you coordinate?), networking (why are API calls slow?), and basic DevOps (monitoring, logging, alerting).

Eval System Design Interview: 5 Common Questions with Detailed Solutions

Eval engineering roles often include system design rounds. Here are 5 common questions and detailed solution sketches. The goal is to show clear thinking, understand tradeoffs, and ask good clarifying questions.

Question 1: Design an Eval Results Storage and Retrieval System

Problem: You need to store eval results from 100+ different evals, each producing different metrics, and support fast queries like "what was the BLEU score for model X on eval Y?" and "show me all examples where the model failed" and "break down accuracy by language and domain."

Clarifying questions: How many results per day? (Order of magnitude.) What's the latency requirement? Do you need real-time or is hourly OK? What's the scale of queries? Thousands per day or millions?

Solution sketch:

OLTP layer (operational database): PostgreSQL for storing results. Schema as described earlier: evals, eval_runs, eval_results tables with appropriate indexes. Use JSONB columns for flexible metadata that doesn't fit the relational schema. Partition results by time (monthly or yearly) to keep tables performant.

Caching layer: Redis for frequently accessed summaries. Cache: (eval_id, run_id) → summary stats, (eval_id) → recent runs, popular queries by language/domain.

OLAP layer (analytical database): For complex queries (breakdowns, correlations, comparisons), use a columnar database like DuckDB or a cloud data warehouse like BigQuery. Sync PostgreSQL → DuckDB nightly via SQL dumps. This gives you the best of both worlds: fast operational queries and fast analytical queries.

API layer: REST API abstracting storage details. Query returns: metric value, confidence interval, sample size, last updated timestamp. Cache API responses aggressively (1 hour for summaries, 1 day for historical).

Artifact storage: S3 or GCS for large artifacts (annotation files, detailed error analyses, visualizations). Database stores reference to S3 keys.

Tradeoffs: PostgreSQL is simpler to set up but slower for analytical queries. DuckDB requires nightly syncs (not real-time) but much faster analytics. Redis caching adds complexity but improves response time. The multi-layer approach balances simplicity, cost, and performance.

Question 2: Design an Evaluation Job Scheduler at Scale

Problem: You have 1000 examples to evaluate, 50 different metrics to compute on each, and metrics have dependencies (some depend on others). Some metrics are expensive (human evaluation costs $10/example, takes hours). Others are cheap (BLEU costs 1 cent, takes milliseconds). You need to schedule this efficiently, handle failures, manage costs, and provide progress visibility.

Clarifying questions: How often do evals run? (Daily, on-demand, when a new model is deployed?) What's acceptable latency? (Hours? Days?) Budget constraints?

Solution sketch:

Orchestration: Use Airflow DAGs. One DAG per eval type. Tasks fan out to compute metrics in parallel.

load_data → compute_metric_1 → compute_metric_2 → ... → aggregate → store_results

Task pools for cost control: Create a task pool called "human_eval_expensive" with max 5 concurrent tasks. Assign human eval tasks to this pool. This ensures you're spending money in a controlled way (at most 5 evaluations in parallel = $50/hour cost cap).

Caching for efficiency: Before running a metric, check if it was computed before with the same inputs (example, metric, model). If yes and nothing has changed, reuse the result. Save 95% of compute time on repeated evals.

Incremental re-runs: If a run partially fails (100 of 1000 examples failed), support restarting from the failure point without re-computing successful examples. Make this transparent to the user.

Error handling: Use task retries with exponential backoff for transient failures. For permanent failures, move to a dead-letter queue. Alert the user about failures but don't block the entire run.

Progress visibility: Provide a status dashboard: X% of tasks done, estimated time remaining, which tasks failed, which tasks are retrying. Emit events to Slack/email when major milestones complete.

Resource management: Track resource usage (CPU, memory, API calls) per task. Alert if a task is using abnormally high resources (might indicate a bug or infinite loop).

Question 3: Design a Multi-Dimensional Eval Comparison System

Problem: You want to compare eval results across multiple dimensions: model version, dataset, language, domain, etc. Users should be able to slice and dice results any way they want: "show me accuracy for model A vs. model B, language=EN, domain=legal." Different dimensions have different cardinalities (maybe 2 model versions but 50+ languages).

Solution sketch:

Schema design: Store all dimension information with each result. Schema: (eval_id, run_id, example_id, metric_name, metric_value, model_version, dataset_name, language, domain, ...).

Pre-computed aggregates: Pre-compute common aggregations and store in a fact table: (eval_id, model_version, language, domain, metric_name) → (mean, stddev, min, max, count). This makes slice queries instant. Update fact tables nightly.

OLAP database: Use Pinot, Druid, or DuckDB for multi-dimensional queries. These databases are optimized for slicing and dicing.

Hierarchical dimensions: Some dimensions have hierarchies (language → region, e.g., en_US → North America). Support roll-up queries: "accuracy by region" automatically aggregates across languages in that region.

API design: Expose simple filters. Users specify dimensions they care about, what aggregation (mean, median, percentile), and the API returns results.

GET /api/compare?model_versions=["v1","v2"]&languages=["en"]&domain=legal&metric=accuracy

Materialized views: For very common queries, create materialized views. Update them incrementally when new results arrive.

Question 4: Design a Quality Monitoring System for Evals

Problem: You want to catch when eval results are unexpected or suspicious. What metric values are outliers? When does eval quality regress (results change dramatically)? How do you detect rater drift (human raters becoming less reliable)? You have hundreds of metrics across dozens of evals.

Solution sketch:

Distributional monitoring: For each metric, maintain a historical distribution. When a new run completes, compare its distribution to the historical one. Use Kolmogorov-Smirnov test or Jensen-Shannon divergence to detect distributional shift. If p < 0.01, alert.

Outlier detection: For each metric, fit a normal distribution to historical values. For new results, flag examples where the metric is 3+ standard deviations from the mean. These are potential data quality issues or genuine hard cases.

Trend monitoring: Fit a time-series model to historical metric values. If today's value deviates significantly from the trend, alert.

Rater quality monitoring: If using human raters, track: inter-rater agreement (Fleiss' kappa, Krippendorff's alpha), rater accuracy (if you have gold labels), task completion time. Alert if agreement suddenly drops or completion time increases (sign of confusion or disengagement).

Alert routing: Route alerts based on severity and type. Critical alert (metric dropped 50%) → page on-call engineer. Warning (metric drifted 20%) → Slack notification. Info (new high-performing model variant) → email. Each alert includes: what changed, how much, what might have caused it, and what to check.

Runbooks: Each alert type has a runbook: "metric dropped 50%, check: (1) did data change? (2) did model change? (3) did annotation criteria change? (4) is there a bug?" Runbooks should be discoverable and executable.

Question 5: Design an Eval-as-a-Service Platform for Multiple Teams

Problem: Multiple product teams want to use a shared eval platform. Each team has different needs (different metrics, different evaluation criteria, different SLAs). How do you build a self-service platform that's flexible but maintains data quality? Teams should be able to submit evals without writing code.

Solution sketch:

Config-driven evals: Provide a declarative format (YAML or JSON) for defining evals. Example:

name: sentiment-analysis-eval-v2
eval_type: classification
dataset: twitter_sentiment_v1
model_id: model_abc123
metrics:
  - name: accuracy
  - name: precision
    per_class: true
  - name: f1
annotations:
  required: true
  sample_size: 200
  task: "Classify sentiment as positive, negative, or neutral"

UI for non-technical users: Build a web UI where teams can define evals without touching code. Form fields for: eval name, dataset selection, metric selection, annotation requirements, etc. Under the hood, this generates the YAML config.

Config validation: When a team submits an eval config, validate: do all referenced datasets exist and is the team authorized to use them? Are all metrics computable? Are annotation requirements realistic (do you have enough raters)? Provide helpful error messages.

Templates: Provide templates for common eval types: classification accuracy, LLM-as-judge, regression metrics, annotation study. Teams can use a template or customize.

Execution pipeline: When config is submitted: (1) validate, (2) create an Airflow DAG, (3) submit to the scheduler, (4) monitor progress, (5) email results when done. Teams get a nice report showing results, confidence intervals, breakdowns by dimension.

Access control: Teams can see their own evals. Admins see everything. Use RBAC: teams have reader/writer/admin roles. Only admins can modify eval configs after creation (audit trail).

Cost tracking: Track compute cost per eval. Allocate costs to teams (chargeback model). This incentivizes efficient evals and makes budgets visible.

Notifications: Teams get notified when evals complete, fail, or need action. Use Slack, email, or in-app notifications based on preference.

Building Eval Tooling for Your Team: Libraries, CLIs, Dashboards, and Open Source

Every company ends up building eval infrastructure. Instead of each team reimplementing, centralize in an internal library.

Internal Eval Libraries

Your library should provide:

Example API:

import eval_sdk

# Load data
dataset = eval_sdk.load_dataset("squad_v2", split="validation")

# Define metrics
metrics = [
    eval_sdk.Metric("exact_match"),
    eval_sdk.Metric("f1"),
    eval_sdk.Metric("bleu"),
]

# Run eval
results = eval_sdk.run_eval(
    name="gpt4-squad-v2",
    model=my_model,
    dataset=dataset,
    metrics=metrics,
)

# Store results
eval_sdk.save_results(results, database=db)

# Analyze
stats = eval_sdk.compute_stats(results)

CLI Tools for Common Tasks

Build command-line tools for common eval tasks:

Make these tools discoverable (eval --help, eval run --help) and well-documented. Include examples in help text.

Dashboards for Different Audiences

Build dashboards for different stakeholders:

Use a BI tool (Looker, Tableau, Grafana) or build custom dashboards with React + D3/Recharts. Key principle: make it easy to ask and answer questions about quality without asking engineers for custom queries.

Open Source Contributions

As you build eval infrastructure, contribute back to the community. This establishes expertise and helps the field. Areas ripe for contribution:

Open source contributions improve your resume and establish technical credibility. Potential employers look for engineers with strong open source track records.

Eval Engineer Career Ladder: IC1 to IC5 and Promotions

Most tech companies use individual contributor (IC) levels. Here's what eval engineering looks like at each level and what promotions look like.

IC1 / Entry-Level Eval Engineer (0-1.5 years)

Scope: One eval at a time. Work is supervised. You're still learning the infrastructure.
Expectations: Implement metrics following established patterns. Load datasets. Write basic eval scripts. Follow coding standards. Write tests for your code. Learn the infrastructure. Come to design reviews prepared. Ask thoughtful questions.
Impact: You execute evals reliably. Others can use your code without issues.
Time to IC2: 12-18 months.
Differentiator: Shows initiative. Understands not just what to do but why. Takes ownership. Learns fast.

IC2 / Mid-Level Eval Engineer (1.5-3 years)

Scope: Multiple evals autonomously. Start influencing technical direction.
Expectations: Design evals end-to-end. Identify what metrics we're missing and propose new ones. Mentor IC1s on best practices. Contribute to infrastructure improvements. Propose and implement tooling enhancements. Communicate results clearly to stakeholders (engineers, PMs, executives). Participate in design reviews as a peer, not just a listener.
Impact: You unblock teams by evaluating their models. Infrastructure improvements you make benefit many teams. Your insights improve how we approach evaluation.
Time to IC3: 18-24 months.
Differentiator: Thinks deeply about what makes a good eval. Proposes novel evaluation approaches. Colleagues respect your technical judgment.

IC3 / Senior Eval Engineer (3-5 years)

Scope: Multiple evaluation domains. Drive architectural decisions. Mentor IC1 and IC2s.
Expectations: Design the eval strategy for a major product area (e.g., all LLM outputs, or all recommendation systems). Build and maintain critical eval infrastructure used by many teams. Conduct research on new evaluation techniques. Lead design reviews. Drive cross-team eval standards. Publish findings (blog posts, internal tech talks, conference talks). Represent the company in eval communities. Make decisions about tool choices, infrastructure design, best practices.
Impact: You move the company's eval maturity forward. Your decisions affect how hundreds of engineers evaluate AI systems. You identify quality issues before they become customer issues.
Time to IC4: 24-36 months (if pursuing staff track).
Differentiator: Deep domain expertise. Recognizes subtle eval problems others miss. Improves quality across the organization. Mentors become strong IC3s.

IC4 / Staff Eval Engineer (5-8 years)

Scope: Eval strategy across multiple products. Organization-wide influence.
Expectations: Set eval standards for the company. Design large-scale evaluation programs (e.g., continuous evaluation infrastructure serving 100+ teams). Identify emerging eval challenges (new model types, new domains, regulatory requirements) and propose solutions. Mentor IC3s who are growing toward IC4. Work with executives on quality strategy and trade-offs (speed vs. quality, cost vs. comprehensiveness). Lead major eval infrastructure projects. Represent company externally at conferences, standards bodies, open source projects.
Impact: You shape how the company approaches AI quality. Your infrastructure decisions affect every team. You help set company quality standards and ensure they're met. Executives consult you on quality decisions.
Time to IC5: 36+ months (if continuing to grow). Note: Not all IC4s promote to IC5. Some specialize deeper at IC4.
Differentiator: Thinks strategically about quality. Moves the field forward (publishes papers, speaks at major conferences). Solves hard problems others thought unsolvable.

IC5 / Distinguished Engineer / Principal of Evals (8+ years)

Scope: Company-wide eval strategy. Industry influence.
Expectations: Set long-term eval vision for the company. Shape how the company approaches AI quality for the next 3-5 years. Mentor IC4s and engineering managers. Publish significant research. Speak at major conferences. Participate in standards bodies (ISO, NIST, etc.), helping set industry direction. Make architectural decisions that affect the entire eval ecosystem. Directly influence product strategy through eval insights.
Impact: You are the company's authority on AI quality. Your insights shape strategy. The industry knows your name and your work.
Differentiator: Recognized expert globally. Changes how people think about AI evaluation. Mentees become leaders in their own right.

Typical Promotion Criteria

Beyond level expectations, promotions usually require:

Potential Career Tracks at Staff Level

Once you hit IC4, you have options:

Most engineers choose one path and stick with it. It's hard to do both specialist and management tracks simultaneously.

Compensation & Equity Benchmarks (2024-2025)

Compensation varies by company stage, geography, experience, and negotiating skill. Here are typical ranges for the US (Bay Area, New York) in early 2025. These are market rates, not guarantees. Your actual offer depends on negotiation and exact circumstances.

Startup Stage (Series A-C, $10M-$100M funding)

Startups offer higher equity percentages but lower salary. The equity is often worth less than FAANG equity (higher failure risk), but the upside is also much higher (potential 10-100x return if successful). Use secondary markets (Carta, CapTable) to see what shares are trading at to estimate current value.

Growth Stage (Series D-F, $100M-$5B valuation)

Better salary than startups, meaningful but typically smaller equity packages than early-stage. Less risk than startups but still meaningful upside. Many growth-stage companies have gone public or been acquired, so secondary markets exist for some companies.

FAANG & Late-Stage Public (>$500B market cap)

FAANG offers the highest total compensation, especially at senior levels. Stock packages are large and vest over 4 years. Signing bonuses are meaningful. Benefits are excellent (healthcare, 401k match, gym membership, childcare support, relocation).

Comparison: IC2 at startup might get $180K salary + 0.1% equity (worth $1M if company hits $1B valuation), total potential = $180K + $1M = $1.18M. IC2 at FAANG gets $240K salary + $150K stock/year = $240K + $600K (4-year vesting) = $840K. Startups have higher upside but higher risk.

Geographic Variations

San Francisco / Silicon Valley: Add 20-30% to listed ranges. Seattle, Boston, NYC: Add 10-20%. Austin, Denver, other tech hubs: Add 5-10%. Remote (except major hubs): Subtract 10-20%. International: European salaries typically 20-30% lower; Asian tech hubs (Singapore, Hong Kong) similar to US but with higher tax burden.

Equity Considerations: How to Evaluate Offers

Evaluate equity offers carefully. Understanding equity is crucial because at early-stage companies, equity is often worth more than salary.

Key terms to understand:

Rough valuation: Early-stage equity (pre-Series A) is highly risky but has huge upside. Expected value calculation: 10% chance of 10x return, 30% chance of 3x, 40% chance of 1x (exit at same valuation), 20% chance of 0x (failure). Expected value = 0.1*10 + 0.3*3 + 0.4*1 + 0.2*0 = 2.3x. So 0.1% equity worth $100K might have expected value of $230K. But it might be worth $1M or $0. Very uncertain.

Later-stage equity is more predictable but smaller. Series E company might have 80% chance of good exit. Expected value of 0.05% equity might be 0.8 * $500K = $400K.

Use these as rough estimates. Get a Carta or Pulley valuation for more precision.

Bonus & Benefits

Beyond salary and equity, most tech companies offer:

Negotiate: signing bonus (usually 10-25% of first-year salary), relocation (if applicable), home office stipend, professional development budget, flexible work arrangement, stock refresh (if staying multiple years, negotiate new grants to offset dilution).

Negotiation Tactics That Work

Most engineers leave 10-20% of compensation on the table by not negotiating. Here's what works:

Remember: recruiters and hiring managers negotiate all the time. They expect it. A modest counter-offer shows you've done your homework and respect your own value. Extreme asks (2x their offer) signal you're not serious or you don't understand the market.

Key Takeaways

Ready to Build Better Evals?

Whether you're starting an eval engineering function or advancing your skills to IC4+, the field needs strong technical leaders. Join the community of engineers building AI quality infrastructure at scale.

Explore More Topics