Data-driven learning approaches for physics simulation, sometimes referred to as world models, have emerged as promising alternatives to traditional physics simulators due to their differentiable nature. Prior work has demonstrated impressive results in predicting the motions of rigid and non-rigid objects in complex scenes involving multiple interacting bodies. However, these models are typically trained in simulated environments because obtaining perfect state information such as complete scene point clouds and point correspondences over time is challenging in real-world settings. This reliance on synthetic data can limit their applicability when the sim-to-real gap is large. In this work, we aim to overcome these limitations by introducing a novel framework for training neural object dynamics models directly from unlabeled real-world videos. Specifically, we propose to learn a particle-based dynamics model compatible with a Gaussian splatting framework, which operates on dense particles derived from Gaussians (i.e., particles with scales and rotations) and predicts their position and rotation changes over time. The model is trained via rendering supervision, enabling learning from real-world videos without requiring particle-level labeled states. Our model operates directly on dense Gaussians without relying on heuristic subsampling anchor points. To enable this study, we also present a real-world dataset consisting of about 500 videos capturing diverse object interactions.
In this work, we propose a training pipeline for learning a particle dynamics model from real-world videos. Our pipeline uses off-the-shelf methods for depth estimation, point tracking, instance segmentation, and pose fitting to extract pseudo 6-DoF poses for objects of interest from videos. We then use these pseudo labels, together with the original images, to train particle dynamics models with both position and rendering supervision.
The dynamics model represents objects as dense particles derived from renderable 3D Gaussians and predicts how those particles move over time. The network follows a point cloud U-Net structure over point features: pointwise MLP layers convert point-wise inputs, including velocity and z-coordinate at t and t-1, into initial point features; interaction blocks progressively move between dense, 2 cm, and 5 cm resolutions; and the decoder restores dense predictions for each particle.
Each interaction block combines two complementary operations. Object PointConv operates between points belonging to the same object and optionally changes the point resolution, while Relational PointConv exchanges information between points of neighboring objects. Together, these operations allow the model to capture force propagation from collisions both within and across objects.
Point color in the above figure indicates the object ID to which each point belongs.
@InProceedings{Kim_2026_CVPRF,
title = {Learning a Particle Dynamics Model with Real-world Videos},
author = {Kim, Chanho and Sumukh, Suhas V. and Fuxin, Li},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings},
year = {2026}}