Back to Home

Agent Leaderboard

Performance benchmarks of AI agents on SkillsBench (84 tasks, 5 trials per task). Tasks that encountered runtime errors during evaluation were excluded. Click a model to view execution traces.

Sort by
Gemini CLIGemini 3 Flash
48.7%
31.3
48.7
Claude CodeOpus 4.5
45.3%
22.0
21.6
45.3
CodexGPT-5.2
44.7%
30.6
25.0
44.7
Claude CodeOpus 4.6
44.5%
30.6
32.0
44.5
Gemini CLIGemini 3 Pro
41.2%
27.6
41.2
Claude CodeSonnet 4.5
31.8%
17.3
15.2
31.8
Claude CodeHaiku 4.5
27.7%
11.0
11.0
27.7
0%25%50%
ClaudeGeminiCodex
No SkillsSelf-GenWith Skills