InterfaceGym flagship benchmark

ComponentBench diagnoses where computer-use agents fail inside modern UI components.

Long-horizon web benchmarks report whether a workflow succeeded. ComponentBench isolates the interaction unit, links failures to canonical components, and compares agents against human steps and human time across observation/action interfaces.

2,910
Full tasks
912
Core tasks
97
Canonical types
14
Families

Live Research Snapshot

Best v1 pass rate
95.2%
gemini-3-flash / Browser-Use
Mean human effort
2.8 steps
4.9s mean task time
Mean agent time gap
23.5x
Task-level agent / human time
HF summaries
32
92 result files indexed
Results are generated from local analysis tables plus a downloaded Hugging Face results manifest. Missing run videos and frame blobs remain explicitly marked as pending backend integration.

What ComponentBench Measures

1
Create component-level tasks
Each task centers one canonical UI component while retaining realistic carrier context such as drawers, tables, modals, and panels.
2
Record human reference
Human steps and wall-clock time become the efficiency baseline, not just an annotation artifact.
3
Run agents through interfaces
Browser-Use, AX-tree, SoM, Pixel, and related modes expose how observation/action surfaces change the outcome.
4
Debug with logs
Every model run should be traceable back to task metadata, verifier state, observations, actions, frames, and failure category.

Why component-level evaluation matters

A five-step workflow with 80% component reliability only succeeds about one third of the time. ComponentBench makes the failing primitive inspectable: selection, overlay opening, text editing, table manipulation, drag/drop, precise sliders, date/time input, and more.

Root cause isolation

Tasks are grouped by canonical component and family.

Human efficiency baseline

Steps and time establish the minimum useful comparison.

Interface dependence

Same task, different observation/action mode, different failure profile.

Results Grounded In Real Runs

Top v1 model/mode rows from `topline_model_mode.csv`.

Open Compare
ModelInterfacePassAvg stepsAvg timeTasks
gemini-3-flashBrowser-Use95.2%4.846.2s2910
gemini-3-flashAX-tree89.6%4.348.5s2910
gemini-3.1-flash-liteBrowser-Use87.4%6.031.5s2910
gemini-3-flashSoM87.1%4.749.3s2910
gpt-5-miniBrowser-Use87.0%5.864.3s2910
gpt-5.4-miniBrowser-Use85.8%5.226.1s2910

Hard Families

Drag/Drop & Workspace Interactions
48.4%
Continuous & High-Precision Input
58.1%
Advanced Editors
60.1%
Date & Time
69.9%
Disclosure & Progressive
71.3%

Mode-Sensitive Components

select native
Avg absolute mode delta
51.0%
kanban board drag drop
Avg absolute mode delta
49.3%
drag drop sortable list
Avg absolute mode delta
40.0%
rich text editor
Avg absolute mode delta
37.5%
drag drop between lists
Avg absolute mode delta
34.3%
context menu
Avg absolute mode delta
28.0%

InterfaceGym Task Lab Pipeline

The site now presents Task Lab as a benchmark production pipeline. Some stages are implemented; moderation queues, smoke-test orchestration, and publishing controls are marked as backend work.

Stage 1
idea
Stage 2
ontology/component selection
Stage 3
structured task spec
Stage 4
page/component generation
Stage 5
verifier/success checker
Stage 6
human trajectory recording
Stage 7
agent smoke test
Stage 8
moderation/approval
Stage 9
publish benchmark pool