End-to-end Recovery of Human Shape and Pose

Angjoo Kanazawa
Michael J Black
David W. Jacobs
Jitendra Malik

University of California, Berkeley
MPI for Intelligent Systems, Tübingen, Germany
University of Maryland, College Park

Human Mesh Recovery (HMR): End-to-end adversarial learning of human pose and shape. We present a real time framework for recovering the 3D joint angles and shape of the body from a single RGB image. Bottom row shows results from a model trained without using any coupled 2D-to-3D supervision. We infer the full 3D body even in case of occlusions and truncations. Note that we capture head and limb orientations.

We present Human Mesh Recovery (HMR), an end-to-end framework for reconstructing a full 3D mesh of a human body from a single RGB image. In contrast to most current methods that compute 2D or 3D joint locations, we produce a richer and more useful mesh representation that is parameterized by shape and 3D joint angles. The main objective is to minimize the reprojection loss of keypoints, which allow our model to be trained using \emph{in-the-wild} images that only have ground truth 2D annotations. However, reprojection loss alone is highly under constrained. In this work we address this problem by introducing an adversary trained to tell whether a human body parameter is real or not using a large database of 3D human meshes. We show that HMR can be trained with and without using any paired 2D-to-3D supervision. We do not rely on intermediate 2D keypoint detection and infer 3D pose and shape parameters directly from image pixels. Our model runs in real-time given a bounding box containing the person. We demonstrate our approach on various images in-the-wild and out-perform previous optimization-based methods that output 3D meshes and show competitive results on tasks such as 3D joint location estimation and part segmentation.


Angjoo Kanazawa, Michael J. Black, David W. Jacobs, Jitendra Malik.

End-to-end Recovery of Human Shape and Pose

arXiv, Dec 2017.




Overview of the proposed framework. An image is passed through a convolutional encoder and then to an iterative 3D regression module that infers the latent 3D representation of the human that minimizes the joint reprojection error. The 3D parameters are also sent to the discriminator D, whose goal is to tell if the 3D human is from a real data or not.

  Code [coming soon]

We present an end-to-end framework for recovering a full 3D mesh of a human body from a single RGB image. We use the generative human body model SMPL, which parameterizes the mesh by 3D joint angles and a low-dimensional linear shape space. Estimating a 3D mesh opens the door to a wide range of applications such as foreground and part segmentation and dense correspondences that are beyond what is practical with a simple skeleton. The output mesh can be immediately used by animators, modified, measured, manipulated and retargeted. Our output is also holistic – we always infer the full 3D body even in case of occlusions and truncations.

There are several challenges in training such an model in an end-to-end manner:
  1. First is the lack of large-scale ground truth 3D annotation for in-the-wild images. Existing datasets with accurate 3D annotations are captured in constrained environments (HumanEva , Human3.6M , MPI-INF-3DHP ). Models trained on these datasets do not generalize well to the richness of images in the real world.
  2. Second is the inherent ambiguities in single-view 2D-to-3D mapping. Many of these configurations may not be anthropometrically reasonable, such as impossible joint angles or extremely skinny bodies. In addition, estimating the camera explicitly introduces an additional scale ambiguity between the size of the person and the camera distance.
In this work we propose a novel approach to mesh reconstruction that addresses both of these challenges. The key insight is even though we don't have a large-scale paired 2D-to-3D labels of images in-the-wild, we have a lot of unpaired datasets: large-scale 2D keypoint annotations of in-the-wild images (LSP , MPII , COCO , etc) and a separate large-scale dataset of 3D meshes of people with various poses and shapes from MoCap. Our key contribution is to take advantage of these unpaired 2D keypoint annotations and 3D scans in a conditional generative adversarial manner.
The idea is that, given an image, the network has to infer the 3D mesh parameters and the camera such that the 3D keypoints match the annotated 2D keypoints after projection. To deal with ambiguities, these parameters are sent to a discriminator network, whose task is to determine if the 3D parameters correspond to bodies of real humans or not. Hence the network is encouraged to output parameters on the human manifold and the discriminator acts as a weak supervision. The network implicitly learns the angle limits for each joint and is discouraged from making people with unusual body shapes.

We take advantage of the structure of the body model and propose a factorized adversarial prior. We show that we can train a model even without using any paired 2D-to-3D training data (pink meshes are all results of this unpaired model). Even without using any paired 2D-to-3D supervision, HMR produces reasonable 3D reconstructions. This is most exciting because it opens up possibilities for learning 3D from large amounts of 2D data.

Please see the paper for more details.


We thank Naureen Mahmood for providing MoShed datasets and mesh retargeting for character animation, Dushyant Mehta for his assistance on MPI-INF-3DHP, and Shubham Tulsiani, Abhishek Kar, Saurabh Gupta, David Fouhey and Ziwei Liu for helpful discussions. This research was supported in part by BAIR and NSF Award IIS-1526234. This webpage template is taken from humans working on 3D who borrowed it from some colorful folks.