ComponentBench diagnoses where computer-use agents fail inside modern UI components.
Long-horizon web benchmarks report whether a workflow succeeded. ComponentBench isolates the interaction unit, links failures to canonical components, and compares agents against human steps and human time across observation/action interfaces.
Live Research Snapshot
What ComponentBench Measures
Why component-level evaluation matters
A five-step workflow with 80% component reliability only succeeds about one third of the time. ComponentBench makes the failing primitive inspectable: selection, overlay opening, text editing, table manipulation, drag/drop, precise sliders, date/time input, and more.
Tasks are grouped by canonical component and family.
Steps and time establish the minimum useful comparison.
Same task, different observation/action mode, different failure profile.
Results Grounded In Real Runs
Top v1 model/mode rows from `topline_model_mode.csv`.
| Model | Interface | Pass | Avg steps | Avg time | Tasks |
|---|---|---|---|---|---|
| gemini-3-flash | Browser-Use | 95.2% | 4.8 | 46.2s | 2910 |
| gemini-3-flash | AX-tree | 89.6% | 4.3 | 48.5s | 2910 |
| gemini-3.1-flash-lite | Browser-Use | 87.4% | 6.0 | 31.5s | 2910 |
| gemini-3-flash | SoM | 87.1% | 4.7 | 49.3s | 2910 |
| gpt-5-mini | Browser-Use | 87.0% | 5.8 | 64.3s | 2910 |
| gpt-5.4-mini | Browser-Use | 85.8% | 5.2 | 26.1s | 2910 |
Hard Families
Mode-Sensitive Components
InterfaceGym Task Lab Pipeline
The site now presents Task Lab as a benchmark production pipeline. Some stages are implemented; moderation queues, smoke-test orchestration, and publishing controls are marked as backend work.