RealWonder: Real-Time
Physical Action-Conditioned Video Generation

Wei Liu¹^* Ziyu Chen¹^* Zizhang Li¹
Yue Wang² Hong-Xing (Koven) Yu^1† Jiajun Wu^1†

¹Stanford University ²University of Southern California
^*Equal contribution ^†Equal advising

Future video prediction conditioned on physical actions

⚡ 13.2 FPS running on a single H200

* Videos are accelerated by 2× to better highlight the dynamic effects.

Long Stream Action-Conditioned Video Generation

We show examples of RealWonder's video streaming generation results conditioned on a sequence of 3D physical actions. For robot gripper actions, we use a Franka (the white robot) model to generate the action sequence and do not use additional visualizations. For 3D forces and 3D wind fields, we use blue arrows and wind icons to visualize the action sequence.

More Examples

Here are more examples showcasing RealWonder's capabilities on diverse scenes and materials. For 3D forces and 3D wind fields, we use blue arrows and wind icons to visualize the action sequence.

Comparisons with Baselines

Side-by-side comparisons of RealWonder (left) with baseline methods (right). Select an example below and choose a comparison method. Since the video baseline models (CogVideoX and TORA) cannot generate long videos, we freeze it after its generation window.

Abstract

Current video generation models cannot simulate physical consequences of 3D actions like forces and robotic manipulations, as they lack structural understanding of how actions affect 3D scenes. We present RealWonder, the first real-time system for action-conditioned video generation from a single image. Our key insight is using physics simulation as an intermediate bridge: instead of directly encoding continuous actions, we translate them through physics simulation into visual representations (optical flow and RGB) that video models can process. RealWonder integrates three components: 3D reconstruction from single images, physics simulation, and a distilled video generator requiring only 4 diffusion steps. Our system achieves 13.2 FPS at 480x832 resolution, enabling interactive exploration of forces, robot actions, and camera controls on rigid objects, deformable bodies, fluids, and granular materials. We envision RealWonder opens new opportunities to apply video models in motion planning, AR/VR, and robot learning.

Method Overview

Research Poster — (Left) Given a single image and a sequence of actions as input, we first reconstruct the 3D scene as point clouds, (middle) estimate material for the objects to interact with, and then maintain a physics simulation stream using the actions. (Right) Meanwhile, we maintain another stream of rendering optical flow and RGB preview to condition a few-step conditional video generator, producing the physical action-conditioned video streaming.

BibTeX

@misc{realwonder2026,
  title={RealWonder: Real-Time Physical Action-Conditioned Video Generation},
  author={Liu, Wei and Chen, Ziyu and Li, Zizhang and Wang, Yue and Yu, Hong-Xing and Wu, Jiajun},
  year={2026},
  eprint={2603.05449},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2603.05449},
}