Point4Cast: Streaming Dynamic Scene Reconstruction and Forecasting
A unified framework that processes streaming 2D frame sequences of a video to estimate the past, present, and future of the underlying dynamic scene, in 3D.
MERL Researchers: Moitreya Chatterjee, Pedro Miraldo, Suhas Lohit.
Joint work with:
Xinhang Liu (The Hong Kong University of Science and Technology)
Huaizu Jiang (Northeastern University)
Naoko Sawada (Mitsubishi Electric Company)
Yu-Wing Tai (Dartmouth College)
Chi-Keung Tang (The Hong Kong University of Science and Technology)
Search MERL publications by keyword: Computer Vision, Machine Learning, Artificial Intelligence,
Understanding how the 3D world evolves over time is a fundamental task in computer vision, essential for embodied settings, autonomous driving, etc. It requires not only the reconstruction of the observed scene but also the anticipation of how the scene dynamics will unfold in the future. While the area of 3D reconstruction has progressed rapidly with the advent of recent feed-forward neural networks, forecasting future dynamics in 3D given the 2D frames of a video remains unexplored. Towards this end, we introduce Point4Cast, a unified framework that processes streaming 2D frame sequences of a video to estimate the past, present, and future of the underlying dynamic scene, in 3D. At the core of our approach lies a persistently evolving latent spacetime representation that models the scene's evolution across time. The approach works by undertaking two operations: (i) Update and (ii) Readout. Upon receiving a new 2D frame, the "update" operation integrates the incoming evidence to refine the latent spacetime representation. When queried for any time instant, whether before, at, or beyond the timestamp of the last update, the "readout" procedure predicts temporally conditioned point maps and camera parameters describing the scene geometry at the queried time. Unlike prior approaches for online dynamic scene reconstruction that estimate each frame's point map solely at the timestamp of the last observed frame, Point4Cast achieves coherent reconstruction across any queried time. Empirical evaluations show that Point4Cast achieves state-of-the-art performance on streaming dynamic scene reconstruction and forecasting benchmarks, across multiple challenging datasets, while providing scene flow estimation and forecasting without the need for any additional inference or training.
Details of the model:
At the core of our method (Point4Cast), lies a persistently evolving spacetime representation that is trained to model the scene's structure and dynamics across the past, present, and anticipated future. As new frames arrive, an update operation integrates incoming observations into this latent representation, progressively constructing a consistent representation of the scene over time, as shown in the figure above (left). When queried with an image and any time instant, Point4Cast performs a readout operation, yielding the scene geometry and camera parameters at the queried time, as shown in the figure above (right). This design enables temporally coherent reconstruction across the observed time span and plausible forecasting of future scene geometry, unifying the tasks of 3D reconstruction and forecasting into a single framework. Moreover, Point4Cast's estimates of the reconstructed point maps over different time steps are aligned to the same coordinate system, allowing for establishing motion tracks of specific points in the scene without the need for any additional inference or training.