Mobilegym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

Dingbang Wu, Rui Hao, Haiyang Wang, Shuzhe Wu, Han Xiao, Zhenghong Li, Bojiang Zhou, Zheng Ju, Zichen Liu, Lue Fan, Zhaoxiang Zhang

A browser-hosted, lightweight simulation platform for mobile GUI agents that enables deterministic evaluation and scalable online RL.

How can we build a high-fidelity, parallelizable mobile GUI simulation environment that enables large-scale agent training without the overhead of real devices or proprietary backends?

Mobile GUI agent research is currently split between heavyweight emulators that are too slow for large-scale training and real devices that are impossible to reset, parallelize, or safely control. MOBILEGYM replaces these with a browser-hosted, Android-like environment where app state is represented as structured JSON, allowing for deterministic state-based judging, snapshot-based forking, and high-consequence sandboxing. This platform enables hundreds of parallel instances on a single server, with a Sim-to-Real study showing that 95.1% of simulation-side training gains transfer to real-device execution.

Paper Primer

The platform hinges on a layered state model that separates read-only world data from mutable runtime state. By exposing this runtime state as structured JSON, the system allows researchers to programmatically reset, snapshot, and fork the environment, effectively turning mobile interaction into a deterministic, parallelizable task.

Training in MOBILEGYM produces policies that generalize to real-world mobile devices.

A GRPO-trained 4B model evaluated on a 59-task real-device subset retained 95.1% of the performance gain observed in simulation. +40.7 percentage point increase in real-device success rate.

The platform significantly reduces the infrastructure overhead required for large-scale agent training.

A single server can host hundreds of parallel instances, each requiring ~400 MB of RAM and ~3 seconds for a cold start. ~1/10th the memory and <1/100th the disk footprint of standard emulator-based setups.

Why is this platform better than using standard Android emulators?

Standard emulators are resource-heavy, making large-scale parallel rollouts impractical on commodity hardware. MOBILEGYM’s lightweight, browser-based architecture allows for hundreds of parallel instances on a single machine, while its structured state enables deterministic verification that avoids the unreliability of VLM-based screenshot judging.

Does this platform simulate the full complexity of real-world mobile apps?

No; it models agent-facing interaction semantics rather than proprietary backends or pixel-level Android internals. It focuses on visual screens, touch responses, and task-relevant state transitions, leaving stochastic backend phenomena like live recommendation feeds or server-side policy changes out of scope.

Introduction

Introducing MOBILEGYM: a browser‑hosted simulator that replaces real‑device bottlenecks with structured‑state control.

Current mobile‑GUI research relies on either heavyweight emulators or fragile real‑device setups. Emulators give repeatable evaluation but cannot scale to online reinforcement learning; real devices provide realistic apps but are slow, costly, and nondeterministic. This tension stalls progress on agents that need both fidelity and massive parallel rollouts.

MOBILEGYM is a browser‑hosted Android‑like simulator that swaps the opaque, slow real‑device pipeline for a fast, fully controllable structured‑state engine.

**Figure 1.** Example screens from MobileGym. Annotated launcher and messaging screens showing MobileGym's configurable and sandboxed capabilities.

The key shift is moving from costly real‑device dependence to a lightweight, structured‑state simulation that preserves interaction fidelity while unlocking massive parallel training.

Related Work

Survey of existing mobile GUI agent environments and their limitations.

Prior work on mobile GUI agents has largely relied on heavyweight Android emulators or physical devices, exposing agents to latency and limited parallelism.

An Android emulator mimics a phone’s software stack on a host computer, while a real‑device runs the same OS on actual hardware; both expose a full UI and system services to the agent.

**Table 1.** Comparison of mobile GUI agent benchmarks and infrastructures. Task-unit labels follow each benchmark's native counting unit. AndroidLab additionally releases 10.5k offline SFT trajectories, not counted here. Validated denotes the real-device transfer study in §5.2, where 95.1% of the simulation-side training gain on the 59-task signal subset is retained. Resource details are in Appendix M.

Beyond mobile, verifiable environments exist for web (WebShop, WebArena), desktop OSes (OSWorld, macOS‑World), and simulated APIs (AppWorld). Recent RL‑based GUI agents (DigiRL, UI‑TARS‑2, UI‑Venus‑1.5) demonstrate the performance gains possible when deterministic, parallel‑friendly platforms are available.

The MOBILEGYM Platform

MOBILEGYM replaces slow device emulators with a browser‑hosted, structured‑state simulation for fast, verifiable training.

The latency of real‑device or emulator loops makes RL on mobile GUIs impractically slow, and the lack of direct state access prevents deterministic reward signals. MOBILEGYM solves this by moving the entire Android‑like stack into a browser and exposing every mutable component as explicit structured state. This design enables parallel rollouts, instant resets, and exact state‑diff judging.

All mutable aspects of the simulated phone—app data, OS settings, and device properties—are stored as a single JSON object that can be read, written, snapshot, and compared.

Merge runtime overlay onto world data → UI shows posts A and B with an empty cart.

Agent taps “Add B to cart” → runtime overlay updates cart to [B].

Snapshot the state → JSON now contains cart: [B] while world data remains unchanged.

Fork a parallel rollout from the snapshot; in the fork the agent removes B from the cart.

Compare the fork’s final state to the original snapshot → diff shows cart changed from [B] back to [] (an unexpected side effect if the task only required adding B).

Because only the runtime overlay is mutable, snapshots are small, and deterministic diffs can isolate exactly which fields the agent altered.

How does Structured State differ from a pixel‑level emulator that records screenshots?

Pixel‑level emulators capture only the rendered image, so they cannot tell which underlying data changed; Structured State records the full JSON of mutable fields, enabling exact comparison of before/after values and eliminating ambiguity about side effects.

**Figure 2.** **End-to-end workflow of MOBILEGYM.** A structured state supports task instantiation, parallel rollout forking, and state-diff verification. The resulting judgments are then converted into benchmark metrics and RL rewards.

**Figure 3.** System capabilities and state model of MOBILEGYM. App views are produced by composing read-mostly World Data, a per-environment Runtime Overlay, and the OS Runtime. The resulting structured environment state supports snapshot/reset/fork and deterministic state-diff judging.

MOBILEGYM-BENCH

MOBILEGYM-BENCH defines a parameterized task suite and the AnswerSheet protocol for reliable evaluation.

Existing mobile benchmarks rely on fixed instances and free‑text answer scoring, which leads to memorization and unreliable evaluation. MOBILEGYM‑BENCH replaces those with a richly parameterized suite and a deterministic form‑based answer protocol.

The suite treats each task as a template that is instantiated on‑the‑fly, yielding thousands of distinct episodes without hand‑authoring each one.

Sample an instruction: “Set an alarm for 7 am called Wake‑Up”.

Sample slot values: time = 07:00, label = “Wake‑Up”.

Inject the environment state (empty alarm list).

Instantiate the task: the agent sees the GUI and must fill the “Time” and “Label” fields.

After execution the simulator records a concrete instance “Alarm(07:00, Wake‑Up)”.

Even with a tiny template, the combinatorial sampling creates many unique episodes, forcing the agent to generalize rather than memorize.

Why does parameterizing tasks matter if the underlying actions are the same?

Parameterization changes the concrete values the agent sees (times, contacts, amounts), so the same procedural logic must succeed across a distribution of inputs. This prevents the model from simply memorizing a single fixed sequence.

Instead of free‑text answers, the agent fills a fixed‑format form; each field has a declared type and a deterministic matcher checks the submission.

The agent reads the screen, extracts the temperature “22 °C” and condition “Sunny”.

It fills the Temp field with “22” and selects “Sunny” in the Condition dropdown.

The form is submitted to the matcher.

The numeric matcher checks |22 − 22| ≤ 0 °C tolerance → passes.

The choice matcher verifies “Sunny” matches the expected label → passes.

Both checks succeed, so the task is marked successful.

The typed submission makes the evaluation deterministic; any deviation in format or value is caught immediately.

How does AnswerSheet differ from traditional free‑text scoring?

Free‑text scoring relies on string similarity, which can reject semantically equivalent answers or accept leaked text. AnswerSheet replaces that with exact type checks, so only the intended information matters, not surrounding wording.

**Figure 4.** **AnswerSheet protocol.** Free-text heuristics can reject equivalent answers or accept leaked reasoning that contains the gold answer. AnswerSheet instead uses GUI form filling and type-specific checks over typed fields.

Experimental Results

Key performance numbers and transfer results for the nine agents on MobileGym‑Bench.

Gemini 3.1 Pro attains the highest overall Success Rate (SR) of 58.8 % on MobileGym‑Bench.

Table 2 reports 58.8 ± 1.4 % SR for Gemini 3.1 Pro, surpassing all other evaluated agents.

Training an agent in the structured‑state simulator and then deploying it on a real device tests whether learned policies survive the gap between simulated and actual UI environments.

How does GRPO differ from standard PPO in this context?

GRPO groups parallel environments and shares a single policy update across the group, reducing variance and stabilizing learning when many short episodes run concurrently. PPO updates each trajectory independently, which can be noisy with the 96‑environment batch used here.

**Figure 5.** Sim-to-Real transfer of GRPO training gains. Per-bucket Success Rate on the 59-task signal-bucket subset and the overall Signal Total. In the legend, Sim/Real denotes the evaluation environment and Base/Trained denotes before/after GRPO. Sim columns are 4-seed averages, Real columns are pass@1 and all manually audited (Appendix J).

GRPO training yields large SR improvements in simulation that largely survive on real devices.

Stratification Sensitivity Analysis

MOBILEGYM swaps slow device access for a fast, structured-state simulation, enabling parallel agent training.

We first examine how the L1–L4 difficulty stratification reacts when the set of reference models is halved from eight to four.

Halving the reference set shifts the L1 lift upward by +1.7 pt.

Eight‑model calibration yields +21.3 pt, while the four‑model calibration yields +23.0 pt.

L2 lift drops by ‑2.9 pt when only four reference models are used.

Eight‑model calibration gives +25.4 pt versus +22.5 pt for the four‑model case.

L3 lift declines by ‑3.8 pt under the reduced reference set.

Eight‑model calibration reports +11.1 pt, four‑model calibration reports +7.3 pt.

L4 lift is essentially unchanged (‑0.2 pt) when the reference pool shrinks.

Eight‑model calibration yields +0.9 pt, four‑model calibration yields +0.7 pt.

Only the proprietary Gemini 3.1 Pro model stays above the 10 % floor on L4 (12.2 %); all others fall ≤8.1 %.

Table 13 shows Gemini 3.1 Pro at 12.2 % L4 SR, the next highest (UI‑Venus‑1.5‑8B) at 1.9 %.

Next we audit the VLM‑judge’s misjudgment rate on the real‑device trajectories.

**Table 15.** Robustness check: VLM-judge error rate when the same saved real-device trajectories are re-judged with GPT-5.4.

The overall VLM‑judge error rate is 10.2 % on the audited real‑device set.

Table 15 aggregates 12 misjudgments out of 118 trajectories.

Trained agents incur a higher misjudgment rate (11.9 %) than base agents (8.5 %).

Separate rows in Table 15 show 7/59 errors for the trained model versus 5/59 for the base model.

We also compare the monetary cost of using different VLM judges.

**Table 16.** Per-run cost comparison if a VLM judge were used

**Table 17.** Cumulative cost at large-scale RL training if a VLM judge were used

Finally, we study how the AnswerSheet protocol influences sim‑to‑real transfer.

With the AnswerSheet in place, the trained model’s sim and real success rates align (+2.6 pt gap).

Sim SR 71.1 % vs. real SR 73.7 % on 19 AnswerSheet tasks.

For the base model the AnswerSheet widens the gap (‑22.9 pt), indicating a failure to generalize.

Base model sim SR 2.1 % versus real SR 25.0 % on the same tasks.

Conclusion and Limitations

MOBILEGYM replaces slow, opaque device environments with a browser‑hosted, structured‑state simulator.

MOBILEGYM replaces slow, opaque device environments with a browser‑hosted, structured‑state simulator, enabling deterministic verification, resettable state, parallel RL, and safe high‑risk operations.

The MOBILEGYM‑BENCH suite operationalizes the platform with 416 parameterized task templates, calibrated difficulty strata, and structured AnswerSheet‑based evaluation, including diagnostics for unexpected side effects.

Across nine agents, experiments reveal substantial headroom on everyday mobile tasks, and the Sim‑to‑Real study shows most simulation‑side training gains transfer to real‑device execution.

The same controllable infrastructure can support safety‑alignment research, robustness testing, and training‑data generation, demonstrating that interaction‑fidelity simulation makes everyday mobile tasks reproducible without real accounts or device farms.

Visual appearance modeling remains limited: subtle layout differences, animations, and app‑specific icons are not fully reproduced, so tasks that depend on exact icon recognition may suffer during transfer.

Backend and dynamic‑content modeling focuses on interaction semantics via controllable JSON state, omitting stochastic server phenomena such as live recommendation dynamics, fraud checks, or latency spikes unless explicitly modeled.

Functional coverage is confined to the main everyday‑use scenarios of each app; less common features are out of scope, and expanding coverage is left for future work.

Ethical Considerations

The section outlines legal, safety, misuse, and societal considerations of the sandboxed simulation platform.

MOBILEGYM is a fully sandboxed research infrastructure. All simulation of commercial apps is disconnected from any real service, real account, real funds, or personal data.

The commercial apps reproduced in MOBILEGYM are used only for academic research and model evaluation. Their trademarks, brand names, and visual elements remain the property of their respective owners, and the simulator UI is independently implemented with LLM‑assisted programming, differing from real apps at the pixel level.

The high‑risk subset (Appendix Table 10) comprises 14 tasks: 7 standalone payment operations and 7 high‑consequence tasks from Test256‑Risk. Gemini 3.1 Pro reaches 64.3 % on Payment and 71.4 % on Test256‑Risk, while smaller open‑source GUI specialists remain at ≤10.7 % on Payment.

Trajectory inspection finds no evidence of explicit refusal: frontier models attempt the operation and largely succeed, whereas open‑source models attempt but fail. This report documents execution capability, not endorsement of autonomous operations, and highlights that “execution capability” and “operational caution” are not currently decoupled.

We argue that capability evaluation must be paired with safety alignment. MOBILEGYM’s ability to simulate irreversible operations provides a no‑real‑risk testing infrastructure for follow‑up safety‑alignment research, which is a core part of its value.

Any GUI‑agent training infrastructure could potentially be used to automate malicious behavior. MOBILEGYM is, by design, a research tool for capability evaluation and safety research, not a production deployment, and we encourage its use for defensive research such as safety alignment, prompt‑injection robustness, and refusal training.

The safe‑simulation properties of MOBILEGYM—zero‑consequence operations, one‑click reset, built‑in difficulty levels—make it suitable for digital‑literacy education. Learners can repeatedly practice tasks such as contact lookup, mobile payment, and ticket booking without any real consequences, opening avenues for digital inclusion, customer‑service training, and AI‑safety education.

System Implementation Details

Implementation details omitted from the main text are provided here.

The TaskManager mirrors Android’s ActivityTaskManager: each app runs in its own Task, which maintains an Activity stack. The manager processes requests via a Reducer pattern (`LAUNCH_APP`, `GO_HOME`, `SHOW_RECENTS`, `CLOSE_TASK`, `PUSH_ACTIVITY`, `POP_ACTIVITY`). Backgrounded Activities are kept in the React tree with display:none, preserving state across app switches.

The state layer follows Android’s hierarchy, persisting user data (OsStateStore.build, settings, hardware managers) across browser refreshes while volatile runtime stores reset. All stores are created by a unified factory and registered globally, enabling a single snapshot or reset call. Managers such as ConnectivityManager cascade dependent state changes automatically.

Cross‑app communication uses an Intent system, a ContentProvider protocol, and a BroadcastBus, mirroring Android’s mechanisms. Intents are resolved by scanning manifests; a unique match triggers a direct transition, while multiple matches invoke a Chooser. ContentProviders expose shared data (Contacts, Sms, Media) with CRUD operations, and broadcasts dispatch system‑level events.

Back‑key handling is implemented by a BackDispatcher that follows a priority‑chain: permission dialog ($1000$) > system shade ($800$) > keyboard ($700$) > app page ($100$) > return‑to‑desktop ($0$). The first handler returning true consumes the event, and a frame‑level back lock prevents duplicate triggers within the same frame.

Every app follows a fixed directory layout (manifest.ts, *App.tsx, state.ts, navigation.declaration.ts, data/defaults.json) and is discovered automatically via Vite’s import.meta.glob. The manifest enables zero‑registration auto‑discovery, while the app code uses a MemoryRouter and a Zustand store for state management.

Agent screenshots are transformed from pixel coordinates to CSS viewport coordinates, then mapped to DOM elements via document.elementFromPoint, and finally injected as PointerEvent/TouchEvent sequences into React. When an app declares a designViewportWidth, an inverse transform compensates for CSS zoom.

Using the standardized architecture and Vite hot‑module replacement, developers built 28 apps in roughly 60 person‑days; typical apps required 3–4 person‑days, while system apps took under one day each. These estimates exclude benchmark authoring and real‑device auditing.

The UI navigation of each app is modeled as an extended‑finite‑state‑machine $M = (S, \Sigma, \Delta, s_0, D, G, U)$, where $S$ are UI states, $\Sigma$ input actions, $\Delta$ the transition function with guards $G$ and updates $U$, and $D$ the set of application variables. Compared to a classic FSM, guards allow the same input to trigger different transitions based on data, and updates can expand the state space dynamically.

Guard specifications include URL‑based constraints (e.g., modal must be absent) and data‑driven conditions (e.g., isFollowing = false) that select navigation targets. The declarative navigation spec drives both runtime transitions via go(...) and static analysis such as BFS path enumeration.

At runtime, navigation is invoked through a hook (const {go, back}=useAppNavigate()) and reflected in DOM elements via data‑trigger attributes, enabling the agent to trigger UI actions directly. Although the agent does not read these attributes, they provide a clear mapping from logical transitions to concrete UI elements.

Section C contains no additional implementation details beyond those already described.

Action Space Details

Appendix details the action space, app coverage, task taxonomy, and AnswerSheet protocol.

The appendix enumerates the full MOBILEGYM action set, the synthetic app suite, example tasks, and the structured AnswerSheet evaluation form, followed by the inference configuration used in experiments.

Inference uses temperature 0.1, `top_p` 0.95, max 4096 tokens, a 0.8 s UI‑stabilization wait after each action, and loop detection after ten identical actions; screenshots are normalized to a 0–1000 coordinate grid.

Training Configuration

Appendix G–I records training settings, reward shaping, model roster, and extended evaluation tables.

GRPO training runs start from the Qwen3‑VL‑4B‑Instruct policy, execute on three GPUs with 96 parallel environments, and use vLLM asynchronous rollouts (group size k = 8). The PPO pipeline processes batches of 12, with a per‑GPU micro‑batch of 2, while the optimizer runs at 1e‑6 learning rate, gradient clipping 1.0, and KL coefficient 0.01. Prompt length is capped at 32768 tokens, response at 1024, rollout length at 40960, and decoding temperature is 0.7 (validation 0.1); each environment action incurs a 0.8 s delay.

The reward is derived from structured rollout artifacts: let p∈[0,1] be task progress, and set the base reward r = p. For AnswerSheet tasks, if the sheet is submitted with any wrong answer, the bookkeeping field is removed before recomputing p, preventing spurious credit. The final reward is then multiplied by four binary factors (goal ∧ ¬clean, false complete ∧ p′>0, post‑success abort, overdue) with coefficients 0.8, 0.8, 0.5, 0.5 respectively, shaping the signal toward clean, complete successes.

The appendix evaluates a diverse set of recent agents: Gemini 3.1 Pro (Google DeepMind), Doubao‑Seed‑2.0‑Pro (ByteDance), Qwen3.6‑Plus (Alibaba), AutoGLM‑Phone‑9B (Zhipu AI), UI‑TARS‑1.5‑8B (ByteDance), UI‑Venus‑1.5‑8B (Venus‑Team), GUI‑Owl‑1.5‑8B‑Think (Mobile‑Agent‑v3.5), Step‑GUI‑4B (Yan et al.), and the base Qwen3‑VL‑4B‑Instruct model.

Table 8 (H.1) breaks down test‑set Success Rate across the three taxonomy axes—Difficulty, Objective, and Composition—extending the difficulty‑only view shown earlier in Table 2.

Table 9 (H.2) reports mean episode length (all steps) and mean length of successful trajectories; successful runs cluster between 8 and 14 steps, while overall episodes vary widely, reflecting frequent early terminations by weaker agents.

The High‑Risk subset (H.3) comprises 14 tasks—seven standalone payment operations and seven high‑risk tasks drawn from test256—chosen to probe execution in irreversible scenarios; it is not a safety test of refusal behavior.

Section H.4 provides a difficulty‑wise breakdown of simulation‑side performance, showing how success rates vary from easy to hard tasks within the Sim‑to‑Real experiments.

Table H.5 stratifies tasks by outcome buckets: Uplift (26 tasks), Stable‑pass (21), Mid (20), Stable‑fail (189). From these, all 67 signal‑bucket tasks are kept, and 15 Stable‑fail tasks are sampled as sanity checks, yielding 74 real‑device evaluations after discarding 8 unreproducible cases.

Table 12 (H.6) compares trajectory lengths on matched real‑device and simulator outcomes. Successful runs differ by only +2.12 steps (trained) or +1.03 steps (base), query successes are shorter on real devices due to omitted AnswerSheet steps, and failure rows reflect differing early‑stop policies rather than length evidence.

`CODEBLOCK_0`

Future Research Directions

Appendix M–N outlines new research avenues enabled by MOBILEGYM.

Because apps, world data, task templates, and judges are modular, researchers can instantiate domain‑specific environments such as mobile finance, travel planning, social‑media safety, or digital‑literacy training while keeping the same reset, snapshot, and state‑based judging interface.

Programmable state and event injection lets agents be evaluated under systematic variations—different permission sets, network conditions, incoming messages, pop‑ups, or phishing‑like content—supporting controlled studies of robustness, prompt‑injection susceptibility, caution gating, side effects, and recovery behavior.

MOBILEGYM can fork identical initial states into many lightweight browser instances, providing a practical testbed for online RL in GUI environments without large emulator clusters or real‑device farms. Researchers can compare reward designs, state‑diff penalties, rollout grouping strategies, and Sim‑to‑Real behavior under reproducible initial states and deterministic outcome signals.

Each interaction step yields a five‑tuple (svisₜ, sjsonₜ, aₜ, svisₜ₊₁, sjsonₜ₊₁) of paired visual and structured‑state transitions, enabling intentional state coverage for training mobile UI world models, state predictors, reward models, or trajectory verifiers.

Read the original paper

Open the simplified reader on Paperglide