Agent Leaderboard

Resolution rates for 24 model–harness configurations, with and without Skills, on the 87-task SkillsBench suite (3 trials per task, max reasoning effort).

datasetskillsbench@1.1· all trajectory results onHugging Face

Agent Performance

Resolution rate vs. mean agent wall-clock per task (log scale, faster to the right). Hover a point for exact values; the no-Skills counterparts are ghosted for context. OpenHands is the default baseline harness; configs run under another harness are labelled with it.

Hover a point to pin its exact resolution rate and wall-clock to the axes. Dashed lines mark fleet means.

Capability Over Time

SkillsBench resolution rate vs. model release date — one dot per model–harness config. Newer models trend up and to the right. Release months are approximate editorial estimates (paper-reported where available). OpenHands is the default baseline harness; configs run under another harness are labelled with it.

Hover a point to pin its release month and resolution rate. Dashed line marks the fleet mean; the emerald line is a least-squares fit.

Agent Leaderboard

Resolution rates across 24 agent–model configurations on SkillsBench (87 tasks, up to 3 trials per task).

dataset: skillsbench@1.1 (v1.1, 87 tasks, registry.json) · recomputed 2026-06-16

Sort by

Skill Invocation RateSelf-generated Skills

#	Agent	Without	With Skills	Δ	Gain (g)	Visualization
1	GPT-5.5OpenHands	51.5%	67.3%	+15.8	32.6%	51.5 → 67.3%
2	GPT-5.5Codex	46.8%	66.5%	+19.7	37.0%	46.8 → 66.5%
3	Opus 4.7Claude Code	43.0%	61.2%	+18.2	31.9%	43.0 → 61.2%
4	Gemini 3.1 ProGemini CLI	36.0%	60.8%	+24.8	38.7%	36.0 → 60.8%
5	GLM 5.1OpenHands	32.7%	58.4%	+25.7	38.1%	32.7 → 58.4%
6	Gemini 3 FlashGemini CLI	34.2%	54.6%	+20.4	31.0%	34.2 → 54.6%
7	Opus 4.8OpenHands	45.7%	54.1%	+8.4	15.5%	45.7 → 54.1%
8	Kimi K2.6OpenHands	33.4%	54.0%	+20.6	31.0%	33.4 → 54.0%
9	Opus 4.7OpenHands	42.1%	53.1%	+11.1	19.1%	42.1 → 53.1%
10	MiniMax M3OpenHands	29.7%	53.0%	+23.3	33.2%	29.7 → 53.0%
11	Gemini 3.1 ProOpenHands	33.8%	52.8%	+19.0	28.7%	33.8 → 52.8%
12	GPT-5.2Codex	29.7%	51.7%	+22.0	31.3%	29.7 → 51.7%
13	Opus 4.6Claude Code	33.7%	50.2%	+16.5	25.0%	33.7 → 50.2%
14	DeepSeek V4 ProOpenHands	26.9%	50.1%	+23.2	31.8%	26.9 → 50.1%
15	Opus 4.5Claude Code	23.8%	49.0%	+25.2	33.1%	23.8 → 49.0%
16	Gemini 3.5 FlashOpenHands	41.1%	48.2%	+7.1	12.1%	41.1 → 48.2%
17	Sonnet 4.6OpenHands	33.5%	47.2%	+13.6	20.5%	33.5 → 47.2%
18	DeepSeek V4 FlashOpenHands	27.5%	44.7%	+17.2	23.7%	27.5 → 44.7%
19	Grok 4.3OpenHands	22.8%	41.7%	+18.8	24.4%	22.8 → 41.7%
20	GPT-5.4 MiniOpenHands	29.9%	41.4%	+11.5	16.4%	29.9 → 41.4%
21	Sonnet 4.5Claude Code	16.7%	36.2%	+19.5	23.4%	16.7 → 36.2%
22	MiniMax M2.7OpenHands	18.1%	34.9%	+16.8	20.5%	18.1 → 34.9%
23	Haiku 4.5Claude Code	8.8%	30.1%	+21.3	23.4%	8.8 → 30.1%
24	Gemini 3.1 Flash LiteOpenHands	16.0%	20.1%	+4.1	4.9%	16.0 → 20.1%

Hover over a row to see confidence intervals and normalized gain.View full leaderboard →

skillsbench@1.1 · 87 tasks · up to 3 trials per task · 95% CIsWithoutWith Skills

OpenAI

Anthropic

Google

Z.ai

Moonshot

MiniMax

DeepSeek

xAI

Professional-Domain Profile

Resolution rate across the eight professional domains of the 87-task taxonomy. Hover a radar axis to inspect that domain; compare up to 4 agents.

Software Engineering16 tasks

GPT-5.5OpenHands52.6 → 63.4%

Opus 4.8OpenHands55.1 → 65.9%

Gemini 3.5 FlashOpenHands46.4 → 50.9%

solid = without Skills · pale = Skill lift · hover another radar axis to switch domain

GPT-5.5· OpenHandsOpus 4.8· OpenHandsGemini 3.5 Flash· OpenHandsrings at 20–100% · with Skills