This post documents a small but complete post-training loop for an agent model: collect environment trajectories, convert them into SFT data, train a LoRA adapter, host the adapter, and evaluate it through the same agent harness.
The result is intentionally modest. A 300-trajectory SFT run improved
Qwen/Qwen3.5-9B on the original 300-task denominator from 4/300 to
16/300. On the final Fireworks-hosted standard60 eval, the same line moved
from 19/180 to 24/180, a +2.78 percentage point change.
That is directionally positive, but not a claim of broad generalization. The useful result is that the full loop works end to end and leaves behind reviewable artifacts: task sets, teacher trajectories, SFT rows, training logs, model commits, and eval trajectories.
Result Summary
The final same-host Fireworks comparison used 60 env-0 tasks with 3 trials per task:
| Eval | Baseline | SFT | Delta |
|---|---|---|---|
| Fireworks standard60, 3 trials | 19/180 (10.56%) | 24/180 (13.33%) | +5 passes, +2.78 pp |
The earlier same-denominator mobile eval, using the 300-task set that produced the SFT data, showed a larger movement:
| Eval | Baseline | SFT | Delta |
|---|---|---|---|
env-0-mobile tasks-eval, 300 rows | 4/300 (1.33%) | 16/300 (5.33%) | +12 passes, +4.00 pp |
Held-out mobile generalization was weaker. On an unseen100 split from
env-0-mobile/tasks-train, baseline was 3/100 and SFT was 2/100.
The headline should therefore be precise:
We validated the SFT pipeline and saw a
+2.78point Fireworks standard60 lift, but the result is not statistically strong and does not yet prove broad held-out generalization.
Artifact Index
GitHub:
- env-0-experiment repository: benchflow-ai/env-0-experiment
- Full technical report: 2026-06-27-qwen35-env0-mobile-sft-technical-report.md
- Final Fireworks report: experiments/fireworks-qwen35-180/EXPERIMENT_REPORT.md
- 300-task SFT reproduction report: experiments/env0-mobile-prime-sft-pr828/REPRODUCTION_REPORT.md
- Task-set registry: task_set_series.json and TASK_SET_SERIES.md
- BenchFlow PR that added the train-data bridge: benchflow-ai/benchflow#828
- env-0 task-set registry commit: 1de52f23aceeb65bc08a01cfad75593910e14c09
Hugging Face:
- Trajectory dataset root: benchflow/env0-experiment-trajectories
- Final Fireworks index: experiments/fireworks-qwen35
- Fireworks baseline 180 trajectories: qwen3p5-9b-standard60-3trials
- Fireworks SFT 180 trajectories: benchflow-qwen35-9b-env0-mobile-sft-live-standard60-3trials
- Fireworks baseline aggregate: summary.json
- Fireworks SFT aggregate: summary.json
- 300-task teacher trajectory set: pr828-env0-mobile-full300-azure-openhands-daytona-20260625T041236Z
- 300-step SFT training artifacts: env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z
- 300-task post-SFT eval: pr828-env0-mobile-sft-full300-qwen35-9b-h100-20260625T100226Z
- SFT model repository: benchflow/benchflow-qwen35-9b
- SFT model commit used for final Fireworks verification: 65613510b42644d44fdf06d4b3d31bc3e9f4ef8e
Pipeline
env-0 task-set curation
-> GPT-5.4-mini teacher rollouts with BenchFlow + OpenHands + Daytona
-> BenchFlow PR828 results.jsonl emission
-> bench train convert to Prime SFT JSONL
-> Qwen/Qwen3.5-9B BF16 LoRA SFT, 300 steps
-> adapter publication to benchflow/benchflow-qwen35-9b
-> base/SFT eval on 300-task denominator
-> held-out unseen100 check
-> Fireworks-hosted standard60 3-trial comparison1. Task Set And Data Boundary
The training-data source was the 300-task env-0-mobile/tasks-eval set. That
set came from generated mobile tasks with reward-1 strong-model trajectories,
then was balanced across six task families: auth, calendar, docs, drive, mail,
and multi-app tasks.
The task-set metadata records env-0 commit
1de52f23aceeb65bc08a01cfad75593910e14c09. The PR828 reproduction report
records the runnable submodule snapshot used during the run as
21e358f7360a9704355556c3d0c8f6466bf5e9c2.
The relevant data boundary:
tasks-eval: 300 copied task directories used for teacher trajectory collection and the same-denominator SFT comparison.tasks-train: 1703 tasks after removing those 300 eval tasks.- Final Fireworks standard60 eval:
env-0/tasks, not the 300env-0-mobile/tasks-evaltraining set.
We separately checked exact task ids. The SFT training task set and the final Fireworks standard60 task set had zero exact task-id overlap. This is not the same as saying the eval is held out by generator family or domain.
2. Teacher Trajectory Generation
The 300 SFT rows came from real BenchFlow rollouts:
- Teacher: Azure GPT-5.4-mini.
- Agent: OpenHands.
- Sandbox: Daytona.
- Task set:
env-0/env-0-mobile/tasks-eval. - Output contract: Verifiers/Prime-RL-shaped
results.jsonl, plus original trajectory artifacts.
The worker-sharded run initially produced 81/300 passes and 5 infra
errors. The five infra-error tasks were refilled, then the run was canonicalized
to one healthy LLM trajectory per task.
Canonical teacher set:
| Metric | Value |
|---|---|
| Canonical pass count | 83/300 |
results.jsonl rows | 300 |
prime-sft.jsonl rows | 300 |
| Rows with tool calls | 175/300 |
| Source LLM exchanges | 2163 |
| Skipped conversion rows | 0 |
The SFT data used all 300 teacher trajectories, not just the 83 teacher-pass trajectories. That choice gave broader coverage but also mixed successful and unsuccessful behavior.
3. BenchFlow PR828 Data Bridge
BenchFlow PR #828 added the bridge from evaluation artifacts to training data:
- job-level
results.jsonlaggregation; bench train validate;bench train convert;- Prime-SFT export and normalization.
The bridge was validated with a synthetic no-spend row, a one-task Docker oracle sanity check, a 10-task spendful canary, and the full 300-task run. The 10-task canary produced 10 Verifiers-shaped rows, 10 Prime SFT rows, 7 rows with tool calls, and zero skipped rows.
4. SFT Training
The student was Qwen/Qwen3.5-9B, using the full non-prequantized checkpoint
as the source model. The training update was adapter-based BF16 LoRA rather
than full-parameter AdamW.
Training run:
env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605ZConfiguration:
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen3.5-9B |
| Base precision | BF16 |
| Base loading | Full, non-quantized |
| Sequence length | 8192 |
| Train rows | 300 |
| Training steps | 300 |
| LoRA rank | 32 |
| LoRA alpha | 64 |
| LoRA dropout | 0.05 |
| Batch size | 1 |
| Gradient accumulation | 8 |
| Learning rate | 1e-4 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
An A100 40GB feasibility check OOMed before training. The accepted run used a Prime Intellect H100 80GB instance under the BenchFlow team context.
Training completed all 300 steps:
| Metric | Value |
|---|---|
| Completed steps | 300/300 |
| Best step | 300 |
| Best eval loss | 0.4590291380882263 |
| Saved checkpoints | 100, 200, 300 |
The final adapter was published to benchflow/benchflow-qwen35-9b.
5. Same-Denominator 300-Task Eval
The trained adapter was first evaluated on the same 300-task denominator used for teacher trajectory collection.
Baseline:
- Model: official
Qwen/Qwen3.5-9B. - Serving: SGLang on Lambda A100.
- Result:
4/300pass,217/300rows with tool calls.
SFT:
- Model: Qwen3.5 SFT adapter.
- Serving: SGLang runtime LoRA on H100.
- Result:
16/300pass,215/300rows with tool calls.
Comparison:
| Stage | Pass | Pass rate | Rows with tool calls |
|---|---|---|---|
| Base | 4/300 | 1.33% | 217/300 |
| SFT | 16/300 | 5.33% | 215/300 |
On this seen denominator, the adapter added 12 passes and 4.00 percentage
points. On the 83 tasks solved by GPT-5.4-mini, base was 3/83 and SFT was
13/83.
6. Held-Out Unseen100 Check
A held-out 100-task split was selected from env-0-mobile/tasks-train, after
excluding all 300 task ids from the canonical SFT denominator.
| Stage | Pass | Pass rate |
|---|---|---|
| Base | 3/100 | 3.00% |
| SFT | 2/100 | 2.00% |
This is why the 300-task result should not be presented as broad generalization. The adapter improved the seen denominator, but the held-out mobile split did not improve.
7. Fireworks Standard60 Eval
The final comparison moved to Fireworks managed deployments:
- Baseline deployment:
openai/accounts/bingran-you/deployments/env0-qwen3p5-9b-standard60 - SFT deployment:
openai/accounts/bingran-you/deployments/benchflow-qwen35-9b-env0-mobile-sft-live - Benchmark: env-0 standard60 (
env-0/tasks). - Agent: OpenHands.
- Sandbox: Daytona.
- Rollout count: 60 tasks x 3 trials = 180 rows per model.
- Skill mode:
with-skill.
Final aggregate:
| Model | Strict pass | Pass rate | Rows with tools | Unscored |
|---|---|---|---|---|
| Fireworks Qwen3.5 baseline | 19/180 | 10.56% | 175/180 | 0 |
| Fireworks BenchFlow Qwen3.5 SFT | 24/180 | 13.33% | 180/180 | 0 |
Trial breakdown:
| Model | Trial | Pass |
|---|---|---|
| Baseline | trial-01-20260626T165651Z | 5/60 |
| Baseline | trial-02-20260627T055735Z | 7/60 |
| Baseline | trial-03-20260627T072748Z-proxy | 7/60 |
| SFT | trial-01-20260627T022446Z | 8/60 |
| SFT | trial-02-20260627T082221Z | 6/60 |
| SFT | trial-03-20260627T090839Z | 10/60 |
The strict pass delta:
24/180 - 19/180 = +5/180
= +2.78 percentage pointsA rough two-proportion check gives p ~= 0.42, with a 95% confidence interval
for the delta of about [-3.92 pp, +9.47 pp]. The result is directionally
positive, operationally validated, and not yet statistically strong.
What We Learned
This experiment validated several pieces of the training loop:
- BenchFlow can emit trainable, verifier-shaped trajectory records through
results.jsonl. - The PR828 bridge can convert real OpenHands rollouts into Prime-SFT rows without dropping rows.
- A small 300-row SFT set can change
Qwen/Qwen3.5-9Bbehavior on the same denominator. - The trained adapter can be hosted behind Fireworks and still produce structured tool calls for OpenHands.
- The final 180-row standard60 comparison is directionally positive with zero unscored rows.
The limitation is equally important: 300 mixed-quality teacher trajectories were not enough to show robust held-out transfer.
Next Step
The next credible experiment should keep the proven loop but improve the data:
- Build a larger disjoint train set from
env-0-mobile/tasks-train. - Prefer verifier-selected or action-quality-filtered trajectories over all raw teacher trajectories.
- Preserve exact task-list provenance before spendful runs.
- Train from a fixed
Qwen/Qwen3.5-9Bsource checkpoint. - Evaluate on both held-out mobile tasks and env-0 standard60.
- Keep Fireworks hosting for the final same-host baseline/SFT comparison.
The current run is a proof that the post-training loop is real. The next run should test whether stronger data selection and a larger denominator turn that loop into a generalizable OpenHands tool-use improvement.