SHeaP: Self-Supervised Head Geometry Predictor Learned via 2D Gaussians

1Woven by Toyota 2Toyota Motor Europe NV/SA 3Technical University of Munich 4Kyoto University

Input: 2D Image

Predicted Gaussians

Tracked Mesh

SHeaP learns a state-of-the-art, real-time head geometry predictor through self-supervised learning on only 2D videos. The key idea is to use 2D Gaussian Splatting instead of mesh rasterization when computing the photometric reconstruction loss.

Abstract

Accurate, real-time 3D reconstruction of human heads from monocular images and videos underlies numerous visual applications. As 3D ground truth data is hard to come by at scale, previous methods have sought to learn from abundant 2D videos in a self-supervised manner. Typically, this involves the use of differentiable mesh rendering, which is effective but faces limitations. To improve on this, we propose SHeaP (Self-supervised Head Geometry Predictor Learned via 2D Gaussians). Given a source image, we predict a 3DMM mesh and a set of Gaussians that are rigged to this mesh. We then reanimate this rigged head avatar to match a target frame, and backpropagate photometric losses to both the 3DMM and Gaussian prediction networks. We find that using Gaussians for rendering substantially improves the effectiveness of this self-supervised approach. Training solely on 2D data, our method surpasses existing self-supervised approaches in geometric evaluations on the NoW benchmark for neutral faces and a new benchmark for non-neutral expressions. Our method also produces highly expressive meshes, outperforming state-of-the-art in emotion classification.

Video

Method Overview

At each training step, we sample a source image \( I_\textit{source} \) and a target image \( I_\textit{target} \). These are both passed through the same vision transformer (ViT), which predicts 3DMM parameters shape \( \bm{\beta} \) , pose \( \bm{\theta} \) and expression \( \bm{\psi} \) , plus an environment lighting latent \( \bm{\ell} \) and identity features \( \bm{f} \). A Gaussians Regressor takes \( \bm{f} \) as input, along with DINOv2 features \( \mathbf{d} \) The Gaussians Regressor predicts a set of Gaussians \( \mathcal{G} \) , which are bound to the predicted 3DMM mesh and rendered with 2DGS to produce \( \hat{I}_\textit{target} \) Finally, photometric losses between \( \hat{I}_\textit{target} \) and \( I_\textit{target} \) are backpropagated to the ViT and Gaussians Regressor parameters, as well as additional losses based on rendered depth, normals, and landmarks

Interactive Results

Click and drag to rotate the 3D mesh. Use the scroll wheel to zoom in and out.

Comparisons Versus Baselines

DECA

EMOCA

SMIRK

SHeaP (ours)

Input

Compared to state-of-the-art methods, our approach reconstructs more accurate head geometry with better temporal stability. It also produces less exaggerated expressions than EMOCA or SMIRK, and accurately models the neck joint.

BibTeX

@article{schoneveld2025sheap,
  title={SHeaP: Self-Supervised Head Geometry Predictor Learned via 2D Gaussians},
  author={Schoneveld, Liam and Chen, Zhe and Davoli, Davide and Tang, Jiapeng and Terazawa, Saimon and Nishino, Ko and Nießner, Matthias},
  booktitle={arxiv},
  year={2025}
}