Resolution rates for 24 model–harness configurations, with and without Skills, on the 87-task SkillsBench suite (3 trials per task, max reasoning effort).
datasetskillsbench@1.1· all trajectory results onHugging Face
Resolution rate vs. mean agent wall-clock per task (log scale, faster to the right). Hover a point for exact values; the no-Skills counterparts are ghosted for context. OpenHands is the default baseline harness; configs run under another harness are labelled with it.
SkillsBench resolution rate vs. model release date — one dot per model–harness config. Newer models trend up and to the right. Release months are approximate editorial estimates (paper-reported where available). OpenHands is the default baseline harness; configs run under another harness are labelled with it.
Resolution rates across 24 agent–model configurations on SkillsBench (87 tasks, up to 3 trials per task).
dataset: skillsbench@1.1 (v1.1, 87 tasks, registry.json) · recomputed 2026-06-16
| # | Agent | With Skills |
|---|---|---|
| 1 | GPT-5.5OpenHands | 67.3% |
| 2 | GPT-5.5Codex | 66.5% |
| 3 | Opus 4.7Claude Code | 61.2% |
| 4 | Gemini 3.1 ProGemini CLI | 60.8% |
| 5 | GLM 5.1OpenHands | 58.4% |
| 6 | Gemini 3 FlashGemini CLI | 54.6% |
| 7 | Opus 4.8OpenHands | 54.1% |
| 8 | Kimi K2.6OpenHands | 54.0% |
| 9 | Opus 4.7OpenHands | 53.1% |
| 10 | MiniMax M3OpenHands | 53.0% |
| 11 | Gemini 3.1 ProOpenHands | 52.8% |
| 12 | GPT-5.2Codex | 51.7% |
| 13 | Opus 4.6Claude Code | 50.2% |
| 14 | DeepSeek V4 ProOpenHands | 50.1% |
| 15 | Opus 4.5Claude Code | 49.0% |
| 16 | Gemini 3.5 FlashOpenHands | 48.2% |
| 17 | Sonnet 4.6OpenHands | 47.2% |
| 18 | DeepSeek V4 FlashOpenHands | 44.7% |
| 19 | Grok 4.3OpenHands | 41.7% |
| 20 | GPT-5.4 MiniOpenHands | 41.4% |
| 21 | Sonnet 4.5Claude Code | 36.2% |
| 22 | MiniMax M2.7OpenHands | 34.9% |
| 23 | Haiku 4.5Claude Code | 30.1% |
| 24 | Gemini 3.1 Flash LiteOpenHands | 20.1% |
Resolution rate across the eight professional domains of the 87-task taxonomy. Hover a radar axis to inspect that domain; compare up to 4 agents.
solid = without Skills · pale = Skill lift · hover another radar axis to switch domain