From a video of a human, our model (blue) can predict 3D meshes that are more temporally consistent
than a method that only uses a single view (pink).
From a single image (purple), our model can recovers the current 3D mesh as well as the past
and future 3D poses.
From an image of a person in action, we can easily guess the 3D motion of the
person in the immediate past and future. This is because we have a mental model of 3D
human dynamics that we have acquired from observing visual sequences of humans
in motion. We present a framework that can similarly learn a representation of
3D dynamics of humans from video via a simple but effective temporal encoding of
image features. At test time, from video, the learned temporal representation can recover
smooth 3D mesh predictions. From a single image, our
model can recover the current 3D mesh as well as its 3D past and future
motion. Our approach is designed so it can learn from videos with 2D pose
annotations in a semi-supervised manner. However, annotated data is always
On the other hand, there are millions of videos uploaded daily on the
Internet. In this work, we harvest this Internet-scale source of unlabeled data
by training our model on them with pseudo-ground truth 2D pose obtained from an
off-the-shelf 2D pose detector. Our experiments show that adding more videos
with pseudo-ground truth 2D pose monotonically improves 3D prediction performance.
We evaluate our model on the recent challenging dataset of 3D Poses in the Wild and
obtain state-of-the-art performance on the 3D prediction task without any fine-tuning.
We thank David Fouhey for providing us with the people subset of
VLOG, Rishabh Dabral for providing the source code for TP-Net,
Timo von Marcard and Gerard Pons-Moll for help with
3DPW, and Heather Lockwood for her help and support. This work was supported in part by Intel/NSF VEC award
IIS-1539099 and BAIR sponsors. This webpage template was borrowed from some colorful folks.