JobBench: Aligning Agent Work with Human Will

Measuring agents by GDP alone asks how much of a human's job can be taken away.

JobBench asks how much of that job can be given back — built on the work that experts across real-world professions actually want delegated to AI.

Leaderboard Blog Contact

GitHub ↗Hugging Face ↗arXiv ↗

agent_01

Current leader

Claude Fable 5

Anthropic

Weighted score57.4%

Professions

Tasks

Criteria

agent_01

Current leader

Claude Fable 5

Anthropic

Weighted score

57.4%

Professions

Tasks

Criteria

In collaboration with

Adopted by

JobBench has been adopted by Meta's Muse Spark 1.1 and Moonshot's Kimi K3

Muse Spark 1.1Meta

Kimi K3Moonshot

§ 01 — Why Human Will

Economics alone is not enough.

agent_01 says

“Let me reconcile. You decide.”

The conversation about AI in the workplace has been framed almost entirely in economic terms: what fraction of working hours can agents absorb? how much of GDP is exposed to automation? Benchmarks like OpenAI's GDPval inherit this framing by design — they select tasks that represent economic value, and score agents on whether they can deliver the professional knowledge output.

We believe this framing, on its own, is not enough.

If agents are going to share the professional workplace with humans, the question is not only what work is most economically valuable to automate, but what work do the humans in that role actually want automated? This is a humanist problem. It treats the professional not as labor to be displaced, but as a collaborator whose judgment about their own craft matters — and it is the premise JobBench is built on.

The economic question

GDPval

OpenAI

“What fraction of a human's job is economically valuable to automate?”

The humanist question

JobBench

Ours

“What work do the humans in that role actually want automated?”

Read the full essay

§ 02 — Rankings

Model leaderboard

Family

Claude Fable 5

57.4

OpenCode

Muse Spark 1.1

54.7

OpenCode

Kimi K3

54.3

OpenCode

Claude Opus 4.8

48.4

OpenCode

GPT-5.6 SOL

45.4

OpenCode

Claude Opus 4.7

44.5

OpenCode

GLM 5.2

43.4

OpenCode

GPT-5.5

38.3

OpenCode

Claude Sonnet 4.6

36.6

OpenCode

GPT-5.4

32.2

OpenCode

Gemini 3.5 Flash

31.5

OpenCode

GPT-5.2

26.6

OpenCode

Claude Sonnet 4.5

20.7

OpenCode

Gemini 3.1 Pro

15.9

OpenCode

All models run on the same harness — OpenCode v1.14.18 — with corresponding max reasoning effort. Grok 4.3 is used as the rubric judge. (Earlier runs used Grok 4.1 Fast, since retired by xAI, so scores may differ slightly from previously reported results.) The Grok judges are chosen mainly for cost: one full eval pass costs ~$2 with Grok 4.1 Fast and ~$20 with Grok 4.3. Claude Fable 5 runs with fallback to Claude Opus 4.8 on refusals.

§ 03 — Headroom

Far from saturation

GPT-5.4 — Codex CLI

0 ———— 100

GDPvalsaturating

83.0

JobBench61 pts headroom

38.9

GPT-5.2 Codex

70.9/24.8

GPT-5.3 Codex

70.9/33.7

GPT-5.4 — Codex CLI

83.0/38.9

Workload

JobBench over GDPval

Wall-clock per task2.40×

Tool calls per task1.40×

Trajectory lines1.40×

§ 04 — Methodology

From knowledge delivery to professional reasoning

§ 05 — Inside a task

What the agent is actually up against

Every JobBench task is a small dossier. Pick one role to see the details.

Role

Reporter — Connecticut investigative desk

Automation desire

4.00/5

Lead in Connecticut drinking water. The state says zero water hazards. The FOIA data says otherwise.

6 sources· 4 types·3 contradictions

Source flow

Heterogeneous inputs

Multiple Hartford-area systems exceed the 15 ppb federal action level.

conflicts with CT_2024_Surveillance_Report — FOIA exceedances vs. 0% home-hazard finding

conflicts with EPA_LCRI_Factsheet — Rule finalized vs. current enforcement cycle

0% of investigated homes identified water as a lead hazard.

conflicts with FOIA_water_data — FOIA exceedances vs. 0% home-hazard finding

CT rows only for 2017–2019; 2020–2022 are dagger-marked non-submissions.

conflicts with martinez_interview — CDC n=1,666 vs. Martinez 30% clinic-specific

10 ppb action level finalized Oct 2024 — not yet enforceable.

conflicts with FOIA_water_data — Rule finalized vs. current enforcement cycle

Pediatric referrals up 30% post-threshold change (Dr. Martinez).

conflicts with CDC_2017_2022_Blood_Lead — CDC n=1,666 vs. Martinez 30% clinic-specific

Waterbury 16.1 ppb vs. Newark 47.9 ppb — trajectory, not point-in-time.

Agent

reasoning over reporter sources

Deliverables

Thesis-driven pitch memo
3-sheet data workbook
15+ entry source verification log

Reasoning challenges by design

click for full detail

§ 06 — Breakdown

Heatmap

Harness: OpenCode v1.14.18 for all models; judge: Grok 4.3. “–” = the run produced no output there — the agent hit the per-task time limit, or the model refused (Claude Fable 5 cells otherwise include its Opus 4.8 fallback on refusals).

Scale0–10%10–20%20–30%30–40%40%+

Occupation	Fable557.4	MuseSpark 1.154.7	KimiK354.3	Opus4.848.4	GPT-5.6SOL45.4	Opus4.744.5	GLM5.243.4	GPT-5.538.3	Sonnet4.636.6	Gemini3.5 Flash31.5	Sonnet4.520.7	Gemini3.1 Pro15.9
Business / Financial Ops
Bookkeeping & Accounting Clerks	77	51	26	–	53	26	0	24	32	27	6	17
HR Specialists	69	9	44	50	38	38	75	28	9	9	0	0
Licensing Examiners / Inspectors	67	78	72	72	69	53	81	50	67	25	8	8
Management Analysts	61	39	38	29	17	44	29	33	26	9	3	6
Personal Financial Advisors	31	31	21	31	46	67	8	10	18	31	0	0
Purchasing Agents	65	61	56	49	48	47	35	35	54	31	20	7
Training & Development Specialists	47	61	57	54	54	49	49	62	26	26	31	7
Avg.	59	47	45	47	46	46	40	35	33	23	10	7
Office / Admin Support
Court Clerks	–	58	55	–	29	37	24	26	42	100	29	18
Customer Service Reps	66	53	50	53	50	63	29	42	29	8	8	0
Data Entry Keyers	89	88	67	70	57	66	72	50	28	59	38	49
Medical Secretaries	54	54	54	54	46	41	41	31	54	41	15	8
Police / Fire Dispatchers	68	36	36	57	36	19	57	28	47	17	36	28
Secretaries & Admin Assistants	52	39	37	38	27	48	31	27	23	5	31	10
Avg.	66	55	50	54	41	46	42	34	37	38	26	19
Computer / Mathematical
Biostatisticians	23	57	51	23	37	43	46	44	25	49	5	20
CS Researchers	35	43	37	20	29	24	18	24	17	9	25	5
Statisticians	43	50	44	50	53	40	53	49	55	48	31	23
User Support Specialists	65	53	57	55	38	48	61	37	15	39	9	44
Web Administrators	36	72	48	36	60	60	72	36	36	24	24	12
Avg.	41	55	47	37	44	43	50	38	29	34	19	21
Architecture / Engineering
Civil Engineers	61	53	58	41	50	49	50	44	45	32	31	31
Mechanical Eng. Technicians	61	47	70	45	48	39	36	35	30	29	20	3
Mechanical Engineers	55	45	67	55	58	36	45	18	0	33	9	0
Petroleum Engineers	52	56	52	52	32	32	28	32	32	12	0	0
Avg.	57	50	62	48	47	39	40	32	27	27	15	9
Management
Financial Managers	67	76	90	55	52	39	33	28	50	43	29	24
Health Services Managers	45	39	33	38	21	32	20	14	34	16	8	14
IT / IS Managers	46	54	48	28	54	43	27	43	30	25	17	10
Supply Chain Managers	29	50	35	5	38	5	18	22	5	5	5	5
Avg.	47	55	51	32	42	30	25	27	30	22	14	13
Arts / Media
Producers	100	64	78	100	83	56	78	78	47	53	19	22
Reporters & Correspondents	73	57	63	73	60	63	63	47	23	33	10	20
Technical Writers	65	66	68	59	45	53	62	44	56	43	55	17
Avg.	79	62	70	77	63	57	68	56	42	43	28	20
Other (Legal · Sales · Science · Edu.)
Lawyers	75	75	75	63	75	63	38	38	50	25	50	13
Online Merchants	67	60	77	69	70	50	59	50	60	35	6	19
Securities Sales Agents	27	27	41	38	35	14	51	27	24	0	14	0
Soc. Sci. Research Assistants	74	71	75	79	70	64	73	70	74	66	46	42
Sociology Teachers (Postsec.)	51	63	52	60	42	54	35	36	39	40	13	15
Tech & Sci. Sales Reps	73	39	63	33	33	44	35	25	14	19	8	11
Avg.	61	56	64	57	54	48	49	41	44	31	23	17

Cite

@misc{li2026jobbenchaligningagentwork,
  title         = {JobBench: Aligning Agent Work With Human Will},
  author        = {Yuetai Li and Yichen Feng and Zhangchen Xu and Zixian Ma and others},
  year          = {2026},
  eprint        = {2605.26329},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI},
  url           = {https://arxiv.org/abs/2605.26329}
}

JobBench: Aligning Agent Work with Human Will

Economics alone is not enough.

GDPval

JobBench

Model leaderboard

Far from saturation

From knowledge delivery to professional reasoning

Human-will grounded

Professional reasoning, not knowledge delivery

Fact-anchored rubrics

Heterogeneous real-world data

What the agent is actually up against

Reporter — Connecticut investigative desk

Reasoning challenges by design

Reconcile the water-vs-paint contradiction

Fact-check Dr. Martinez's 30% quote

LCRI: finalized vs. enforceable

90th-percentile action-level rule

Heatmap