JobBench: Aligning Agent Work with Human Will

Measuring agents by GDP alone asks how much of a human's job can be taken away.

JobBench asks how much of that job can be given back — built on the work that experts across real-world professions actually want delegated to AI.

agent_01
Current leader
GPT-5.4
OpenAI · via Codex CLI
Weighted score37.2%
0
Professions
0
Tasks
0
Criteria

In collaboration with

University of Washington
UC Santa Barbara
Stanford University
Carnegie Mellon University
University of Notre Dame
IBM Research
BakeAI
Michigan State University
UC Berkeley
Northwestern University
University of Chicago
§ 01 — Why Human Will

Economics alone is not enough.

The conversation about AI in the workplace has been framed almost entirely in economic terms: what fraction of working hours can agents absorb? how much of GDP is exposed to automation? Benchmarks like OpenAI's GDPval inherit this framing by design — they select tasks that represent economic value, and score agents on whether they can deliver the professional knowledge output.

We believe this framing, on its own, is not enough.

If agents are going to share the professional workplace with humans, the question is not only what work is most economically valuable to automate, but what work do the humans in that role actually want automated? This is a humanist problem. It treats the professional not as labor to be displaced, but as a collaborator whose judgment about their own craft matters — and it is the premise JobBench is built on.

The economic question

GDPval

OpenAI

“What fraction of a human's job is economically valuable to automate?”

The humanist question

JobBench

Ours

“What work do the humans in that role actually want automated?”

Read the full essay
§ 02 — Rankings

Model leaderboard

Family
Scaffold
1
GPT-5.4
37.2
2
Claude Sonnet 4.6
36.3
3
Claude Opus 4.6
35.4
04
GPT-5.2
33.6
05
GPT-5.3 Codex
33.1
06
Claude Opus 4.5
31.0
07
Claude Sonnet 4.5
26.8
08
GPT-5.1 Codex
26.2
09
GPT-5.2 Codex
24.8
10
Claude Opus 4
20.9
11
Claude Sonnet 4
17.9
12
Qwen 3.5 Plus
17.6
13
Claude Haiku 4.5
15.2
14
MiniMax M2.5
14.2
15
Gemini 3 Pro
10.9
16
Gemini 3 Flash
10.8
17
Kimi K2.5
8.6
18
Grok 4.2 Fast
4.2

Score = weighted rubric score across all evaluated tasks.

§ 03 — Headroom

Far from saturation

GPT-5.4
GDPvalsaturating
83.0
JobBench63 pts headroom
37.2
GPT-5.2 Codex
70.9/24.8
GPT-5.3 Codex
70.9/33.1
GPT-5.4
83.0/37.2
Workload
JobBench over GDPval
Wall-clock per task1.64×
Tool calls per task1.33×
Trajectory lines1.30×
§ 04 — Methodology

From knowledge delivery to professional reasoning

§ 05 — Inside a task

What the agent is actually up against

Every JobBench task is a small dossier. Pick one role to see the details.

Role

Reporter — Connecticut investigative desk

Automation desire
4.00/5
Lead in Connecticut drinking water. The state says zero water hazards. The FOIA data says otherwise.
6 sources· 4 types·3 contradictions
Source flow
Heterogeneous inputs

Multiple Hartford-area systems exceed the 15 ppb federal action level.

conflicts with CT_2024_Surveillance_ReportFOIA exceedances vs. 0% home-hazard finding
conflicts with EPA_LCRI_FactsheetRule finalized vs. current enforcement cycle

0% of investigated homes identified water as a lead hazard.

conflicts with FOIA_water_dataFOIA exceedances vs. 0% home-hazard finding

CT rows only for 2017–2019; 2020–2022 are dagger-marked non-submissions.

conflicts with martinez_interviewCDC n=1,666 vs. Martinez 30% clinic-specific

10 ppb action level finalized Oct 2024 — not yet enforceable.

conflicts with FOIA_water_dataRule finalized vs. current enforcement cycle

Pediatric referrals up 30% post-threshold change (Dr. Martinez).

conflicts with CDC_2017_2022_Blood_LeadCDC n=1,666 vs. Martinez 30% clinic-specific

Waterbury 16.1 ppb vs. Newark 47.9 ppb — trajectory, not point-in-time.

Agent
reasoning over reporter sources
Deliverables
  • Thesis-driven pitch memo
  • 3-sheet data workbook
  • 15+ entry source verification log

Reasoning challenges by design

click for full detail
§ 06 — Breakdown

Heatmap

Scale0–10%10–20%20–30%30–40%40%+
Occupation
GPT-5.437.2
Sonnet4.636.3
Opus4.635.4
GPT-5.233.6
GPT-5.3Codex33.1
Opus4.531.0
Sonnet4.526.8
GPT-5.1Codex26.2
GPT-5.2Codex24.8
Opus420.9
Sonnet417.9
Qwen3.5 Plus17.6
Haiku4.515.2
MiniMaxM2.514.2
Gemini3 Pro10.9
Gemini3 Flash10.8
KimiK2.58.6
Grok4.2 Fast4.2
Business / Financial Ops
Bookkeeping & Accounting Clerks1923511743130191714449414900
HR Specialists563147883419191941090000090
Licensing Examiners / Inspectors5033331717423317833333317331725420
Management Analysts263018272413166013000310330
Personal Financial Advisors334182336182110103110080231000
Purchasing Agents25434724343927211833161618871122
Training & Development Specialists3841342030423016303630221818161404
Avg.3535343131262115182315111010121081
Office / Admin Support
Court Clerks373237453747024211113110001300
Customer Service Reps215029291650816298816021021160
Data Entry Keyers5966555861543947512036282632221779
Medical Secretaries5123413815158154180151588880
Police / Fire Dispatchers36473636361547472615574719301111150
Secretaries & Admin Assistants723046464837202011204122306011206
Avg.4641414235362028301426231516713112
Computer / Mathematical
Biostatisticians29251220461837572812282225281511119
CS Researchers16381911221220898141514400411
Statisticians361844363437362622301514141414874
User Support Specialists39365748334538283219293922261212250
Web Administrators52483624242440121224121212121224120
Avg.34333428322734262119192017171111125
Architecture / Engineering
Civil Engineers5355525135433649423018222624182536
Mechanical Eng. Technicians24322020192927251551215129146153
Mechanical Engineers3627522705218189000009990
Petroleum Engineers1228360161228322012012122001200
Avg.313540251834273121127121213101372
Management
Financial Managers1459442433142624329181018491504
Health Services Managers20332026819208819141488141484
IT / IS Managers411736492724121715171510101581200
Supply Chain Managers1712176121212061706006066
Avg.2330292620171712151512109791033
Arts / Media
Producers536442645339397264283131222280014
Reporters & Correspondents472020374733233320431010130100010
Technical Writers55645045494554373435424141301211279
Avg.5249374849393948393527272617104911
Other (Legal · Sales · Science · Edu.)
Lawyers50253825252550253801325131300250
Online Merchants63324543433021593038203014251420149
Securities Sales Agents3535142759272704124110000000
Soc. Sci. Research Assistants585247595955374952272920242918201413
Sociology Teachers (Postsec.)57292836364534334135111714211715147
Tech & Sci. Sales Reps12101330181161110136191063000
Avg.463131374032292935231519121699115