InterfaceGym flagship benchmark

ComponentBench diagnoses where computer-use agents fail inside modern UI components.

Long-horizon web benchmarks report whether a workflow succeeded. ComponentBench isolates the interaction unit, links failures to canonical components, and compares agents against human steps and human time across observation/action interfaces.

2,910

Full tasks

912

Core tasks

Canonical types

Families

Explore Benchmark View Results Task Lab Pipeline

Live Research Snapshot

Best v1 pass rate

95.2%

gemini-3-flash / Browser-Use

Mean human effort

2.8 steps

4.9s mean task time

Mean agent time gap

23.5x

Task-level agent / human time

HF summaries

92 result files indexed

Results are generated from local analysis tables plus a downloaded Hugging Face results manifest. Missing run videos and frame blobs remain explicitly marked as pending backend integration.

What ComponentBench Measures

Create component-level tasks

Each task centers one canonical UI component while retaining realistic carrier context such as drawers, tables, modals, and panels.

Record human reference

Human steps and wall-clock time become the efficiency baseline, not just an annotation artifact.

Run agents through interfaces

Browser-Use, AX-tree, SoM, Pixel, and related modes expose how observation/action surfaces change the outcome.

Debug with logs

Every model run should be traceable back to task metadata, verifier state, observations, actions, frames, and failure category.

Why component-level evaluation matters

A five-step workflow with 80% component reliability only succeeds about one third of the time. ComponentBench makes the failing primitive inspectable: selection, overlay opening, text editing, table manipulation, drag/drop, precise sliders, date/time input, and more.

Root cause isolation

Tasks are grouped by canonical component and family.

Human efficiency baseline

Steps and time establish the minimum useful comparison.

Interface dependence

Same task, different observation/action mode, different failure profile.

Results Grounded In Real Runs

Top v1 model/mode rows from `topline_model_mode.csv`.

Open Compare

Model	Interface	Pass	Avg steps	Avg time	Tasks
gemini-3-flash	Browser-Use	95.2%	4.8	46.2s	2910
gemini-3-flash	AX-tree	89.6%	4.3	48.5s	2910
gemini-3.1-flash-lite	Browser-Use	87.4%	6.0	31.5s	2910
gemini-3-flash	SoM	87.1%	4.7	49.3s	2910
gpt-5-mini	Browser-Use	87.0%	5.8	64.3s	2910
gpt-5.4-mini	Browser-Use	85.8%	5.2	26.1s	2910

Hard Families

Drag/Drop & Workspace Interactions

48.4%

Continuous & High-Precision Input

58.1%

Advanced Editors

60.1%

Date & Time

69.9%

Disclosure & Progressive

71.3%

Mode-Sensitive Components

select native

Avg absolute mode delta

51.0%

kanban board drag drop

Avg absolute mode delta

49.3%

drag drop sortable list

Avg absolute mode delta

40.0%

rich text editor

Avg absolute mode delta

37.5%

drag drop between lists

Avg absolute mode delta

34.3%

context menu

Avg absolute mode delta

28.0%

InterfaceGym Task Lab Pipeline

The site now presents Task Lab as a benchmark production pipeline. Some stages are implemented; moderation queues, smoke-test orchestration, and publishing controls are marked as backend work.

Stage 1

idea

Stage 2

ontology/component selection

Stage 3

structured task spec

Stage 4

page/component generation

Stage 5

verifier/success checker

Stage 6

human trajectory recording

Stage 7

agent smoke test

Stage 8

moderation/approval

Stage 9

publish benchmark pool

Open Task Lab Read Protocol Docs HF results dataset