SkillsBench 1.1: Agent Skills Benchmark Release

Research

SkillsBench 1.1 updates the Agent Skills benchmark to 87 native BenchFlow task.md packages across 8 domains. The paper reports 18 model-harness configurations; the public leaderboard currently tracks 24 in total including previous results. In the paper aggregate, curated Skills raise mean resolution rate from 33.9% to 50.5% (+16.6 points).

SkillsBench 1.1 overview: 87 tasks, 8 domains, 18 model-harness configurations, 4 harnesses, with-Skills resolution rate by model release date, per-configuration "Skill Lift", and skill-invocation rates
SkillsBench 1.1 overview: 87 tasks, 8 domains, 18 model-harness configurations, four harnesses, with-Skills resolution rate by model release date, per-configuration "Skill Lift", and skill-invocation rates.

SkillsBench 1.1 is the current release of the benchmark for evaluating how AI agents use Agent Skills: structured packages of instructions, scripts, and reference material mounted at inference time.

The v1.1 paper, SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks, is available on arXiv. The current arXiv version reports an inventory of 87 tasks across 8 domains, paired evaluation with and without curated Skills, and aggregate results over 18 model-harness configurations. We are honored to have Prof. Dawn Song joins us as advising author.

The benchmark tasks set release is pinned as GitHub release v1.1, mirrored on Hugging Face, and accompanied by public trajectories in the SkillsBench leaderboard dataset.

Agent architecture stack and resolution rates across 18 model-harness configurations
The agent stack and the paper aggregate: resolution rates across 18 model-harness configurations on the 87-task SkillsBench 1.1 roster. Bars show the no-Skills baseline plus curated-Skills lift; markers show 95% confidence intervals. OH = OpenHands, CC = Claude Code, GCLI = Gemini CLI.

Release Contents

  • 87 native BenchFlow tasks. The v1.1 roster is packaged as native task.md tasks with environment/, oracle/, and verifier/ directories.
  • 8 domains. The task taxonomy covers Software Engineering, Industrial & Physical Systems, Natural Science, Office & White Collar, Finance & Economics, Mathematics & OR, Cybersecurity, and Media & Content Production.
  • Four harnesses in the paper aggregate. The paper reports Claude Code, Codex, Gemini CLI, and OpenHands.
  • 18 paper configurations and 24 public leaderboard configurations. The paper aggregate covers 18 model-harness configurations. The live leaderboard currently tracks 24 in total including previous results on the same 87-task roster.
  • Skill-invocation tracking. With-Skills runs record whether an agent reads or invokes the task-specific Skills it is given.
  • BenchFlow compatibility. The release experiment runs on BenchFlow as the evaluation harness.
  • Additional distribution surfaces. SkillsBench 1.1 is also available through Prime Intellect, AgentBeats, and Harbor.

Credential-dependent or integration-incompatible packages remain under tasks-extra/ and are excluded from the default benchmark roster.

Method

SkillsBench uses paired evaluation. The same task is run in the same container under matched no-Skills and curated-Skills conditions, so "Skill Lift" is measured at the task and configuration level.

Three-phase SkillsBench construction and evaluation pipeline
Construction and evaluation pipeline. Phase 1 combines contributor task submissions with a 2,014,000-Skill ecosystem snapshot. Phase 2 applies automated checks and human review. Phase 3 runs matched no-Skills and curated-Skills conditions with 3 trials per task across 18 configurations, producing 9,396 scored trajectories.
Paired evaluation design with no-Skills, curated-Skills, and self-generated conditions
Task structure and conditions. Each task has an instruction, deterministic tests, and a withheld oracle. Condition A runs instruction-only. Condition B mounts the expert-authored Skills bundle; the instruction does not name the Skills. Condition C, used for the self-generated-Skills analysis, lets the agent author its own Skills.

Task Suite

SkillsBench task distribution across domains and difficulty
87 tasks across 8 domains and 3 difficulty tiers: Core (< 60 min, 6 tasks), Extended (1-4 h, 53 tasks), and Extreme (> 4 h, 28 tasks).

Results

With-Skills Resolution Rate Over Time

The release-date chart reports a rise from 36.2% with-Skills resolution rate for Claude Code + Sonnet 4.5 to 67.3% for OpenHands + GPT-5.5. Using the plotted release-month annotations, the fitted increase is about +1.9 points per month.

On Claude Code, the model-generation sequence is:

Claude Code modelNo SkillsWith Skills
Haiku 4.58.830.1
Sonnet 4.516.736.2
Opus 4.523.849.0
Opus 4.633.750.2
Opus 4.743.061.2

The highest with-Skills result in the current public leaderboard is OpenHands + GPT-5.5 at 67.3%.

Curated Skills Across Configurations

In the 18-configuration paper aggregate, all configurations have higher resolution rates with curated Skills. The mean resolution rate rises from 33.9% to 50.5% (+16.6 points; 25.5% normalized gain), with configuration-level gains from +4.1 to +25.7 points.

Harness + ModelNo SkillsWith Skills"Skill Lift"
OpenHands · GPT-5.551.567.3+15.8
Codex · GPT-5.546.866.5+19.7
Claude Code · Opus 4.743.061.2+18.2
Gemini CLI · Gemini 3.1 Pro36.060.8+24.8
OpenHands · GLM 5.132.758.4+25.7
OpenHands · Opus 4.845.754.1+8.4
OpenHands · MiniMax M329.753.0+23.3
OpenHands · DeepSeek V4 Pro26.950.1+23.2

The largest lifts in the public leaderboard table are OpenHands + GLM 5.1 (+25.7), Gemini CLI + Gemini 3.1 Pro (+24.8), OpenHands + MiniMax M3 (+23.3), and OpenHands + DeepSeek V4 Pro (+23.2).

The live leaderboard is recomputed from the public Hugging Face trajectory dataset. It currently covers 24 configurations in total including previous results, with a with-Skills mean of 49.2%.

Cross-Model Comparisons

Selected comparisons in the public leaderboard:

ComparisonNo-Skills referenceWith-Skills result
GLM 5.1 with Skills vs. Opus 4.8 without SkillsOpenHands · Opus 4.8: 45.7OpenHands · GLM 5.1: 58.4
MiniMax M2.7 with Skills vs. GLM 5.1 without SkillsOpenHands · GLM 5.1: 32.7OpenHands · MiniMax M2.7: 34.9
MiniMax M2.7 with Skills vs. MiniMax M3 without SkillsOpenHands · MiniMax M3: 29.7OpenHands · MiniMax M2.7: 34.9

"Skill Lift" By Domain

"Skill Lift" is positive in all 8 domains in the paper aggregate.

DomainNNo SkillsWith Skills"Skill Lift"
Natural Science1442.070.8+28.8
Media & Content Production523.347.4+24.1
Cybersecurity729.548.4+18.9
Industrial & Physical Systems1423.939.6+15.7
Finance & Economics919.133.3+14.2
Office & White Collar1440.553.0+12.6
Software Engineering1637.649.2+11.6
Mathematics & OR845.755.4+9.7

Across tasks, 13 of 87 tasks have negative "Skill Lift" in the paper aggregate.

Skill Invocation Rate

Skill Invocation Rate is the share of with-Skills trials in which the agent reads or invokes the task-specific Skills it is given.

Skill invocation rate alongside resolution rate across 18 configurations
Resolution rate (solid bars, left axis) and Skill Invocation Rate (hatched bars, right axis) for 18 configurations. The count includes task-specific Skills; harness-bundled Skills are treated as part of the harness.

Invocation rates range from 46% to 99%. Codex + GPT-5.5 is at 99%, OpenHands + GPT-5.5 at 92%, Gemini CLI + Gemini 3.1 Pro at 90%, and OpenHands + Sonnet 4.6 at 89%. Some failed runs still include a recorded Skill invocation.

Self-Generated Vs. Curated Skills

In the self-generated condition, the agent authors its own Skills before solving. The paper reports this condition on three dedicated-harness configurations.

No Skills, self-generated Skills, and curated Skills across three configurations
No Skills, self-generated Skills, and curated Skills across three configurations. In these runs, self-generated Skills are below the no-Skills baseline; curated Skills are above it.

Self-generated Skills change resolution rate by -8.1 points for Claude Code + Opus 4.7, -11.3 points for Codex + GPT-5.5, and -11.5 points for Gemini CLI + Gemini 3.1 Pro. Curated Skills add +18.2 to +24.8 points on the same configurations.

Skill Quantity And Length

"Skill Lift" varies by the number and length of Skills attached to a task.

Skill quantityLift
1 Skill+18.0
2-3 Skills+19.0
4+ Skills+10.1
Skill lengthLift
Compact+19.0
Standard+21.5
Detailed+14.5
Comprehensive+0.7

Resolution Rate And Agent Time

The paper also reports mean agent wall-clock time per task.

Per-family shift from no Skills to curated Skills in the time-performance plane
Per-family shift from no Skills (hollow) to curated Skills (solid) in the time-performance plane. Gray points show the remaining fleet.

Across the fleet, curated Skills add +16.6 points to the mean resolution rate. Mean agent wall-clock per task changes from 14.5 minutes without Skills to 13.8 minutes with Skills in the public leaderboard snapshot.

Resolution rate vs. mean agent wall-clock per task, without and with curated Skills
Resolution rate vs. mean agent wall-clock per task, without Skills and with curated Skills. Dashed lines mark fleet means.

Citation Count

An internal June 2026 count across Google Scholar, Semantic Scholar, and arXiv found about 130 citations of SkillsBench over roughly four months. Citation counts are approximate and change over time.

Acknowledgements

The v1.1 paper and release include contributions from the SkillsBench author and contributor group listed on arXiv. The launch materials acknowledge support from Google DeepMind, Kaggle, OpenHands, Daytona, Prime Intellect, KDense, Vals AI, and other infrastructure and evaluation partners.

Resources

SkillsBench and its evaluation infrastructure are open source under the Apache 2.0 license. Contributions of tasks, Skill sets, and harnesses are accepted through the project repository.

Cite this work

arXivPDF
@article{li2026skillsbench,
  title={SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks},
  author={Li, Xiangyi and Liu, Yimin and Chen, Wenbo and You, Bingran and Di, Zonglin and He, Yifeng and Zheng, Shenghan and Choe, Kyoung Whan and Sun, Jiankai and Wang, Shuyi and Tao, Chujun and Li, Binxu and Zhao, Xuandong and Geng, Hejia and Wu, Xiaojun and Zhou, Junwei and Chen, Xiaokun and Xing, Hanwen and Li, Yubo and Zeng, Qunhong and Wang, Di and Wang, Yuanli and Chaim, Roey Ben and Jiang, Penghao and Shen, Haotian and Kong, Luyang and Liu, Xinyi and Wang, Runhui and Liu, Xuanqing and Li, Jiachen and Lan, Xin and Lin, Yueqian and Ye, Wengao and He, Junwei and Li, Songlin and Zhang, Yue and Gao, Yipeng and Li, Yijiang and Ma, Ze and Jing, Liqiang and Wang, Tianyu and Li, Kaixin and Xue, Yiqi and Lyu, Haoran and He, Yizhuo and Tian, Yuchen and Wu, Shutong and Wang, Bowei and Gao, Yixuan and Chen, Bo and Liu, Litong and Cheng, Sikai and Bao, Jiajun and Tong, Shuaicheng and Xu, Shuwen and Zhuo, Terry Yue and Ye, Tinghan and Qi, Qi and Li, Miao and Liao, Longtai and Tan, Zelin and Shi, Chang and Tang, Xilin and Tankasala, Srinath and Yuan, Boqin and Qian, Yaoyao and Tu, Jianhong and Wang, Chenguang and Sun, Yizhou and Wang, Wei and Taylor, Aaron and Yang, Ziyue and Guan, Changkun and Dong, Zhikang and Zhang, Xinyu and Dillmann, Steven and Lee, Han-chung and Song, Dawn},
  journal={arXiv preprint arXiv:2602.12670},
  year={2026}
}