Zero-shot Human Pose Estimation using Diffusion-based Inverse solvers

Problem Statement

Pose estimation refers to tracking a human's full body posture, including their head, torso, arms, and legs. The problem is challenging in practical settings where the number of body sensors are limited. Past work has shown promising results using conditional diffusion models, where the pose prediction is conditioned on both <location, rotation> measurements from the sensors. Unfortunately, nearly all these approaches generalize poorly across users, primarly because location measurements are highly influenced by the body size of the user. In this paper, we formulate pose estimation as an inverse problem and design an algorithm capable of zero-shot generalization. Our idea utilizes a pre-trained diffusion model and conditions it on rotational measurements alone; the priors from this model are then guided by a likelihood term, derived from the measured locations. Thus, given any user, our proposed InPose method generatively estimates the highly likely sequence of poses that best explains the sparse on-body measurements.

Method




Interpolation end reference image.

We propose full body 3D pose estimation from sparse sensors as an inverse problem, for which we learn the prior through a diffusion model.

Our core observation is that any human's full-body pose can be decomposed into a ''scale-free pose'' and a scale-dependent component. For human poses, the scale-free pose can be imagined as a template human body whose skeletal joints (e.g., shoulders, elbows, hip, knees, etc.) are rotated appropriately to create a given pose. The scale-dependent component is the location of the joints in 3D space. Forward kinematics relates the scale-free pose, along with the body size, to the scale-dependent component. Since the sensors give <location, rotation> measurements from 3 body joints, it is possible to estimate a distribution of scale-free poses from rotational measurements alone. Then, the location measurements can be used to sharpen this distribution to poses that best explain jointly both the rotation and location measurements. This decomposition lends itself to an inverse problem formulation, shown visually in the Figure above. Using <location, rotation> measurements from 3 body joints---head and two wrists---InPose aims to track the locations of all 22 body joints, necessary to fully define the full 3D pose of a human.

Results


Left is ground truth, right is InPose.

Each of the above samples shows pose tracking results for a few second segments. Errors are shown using colors (more red indicates higher error). InPose figures out lower body movement using the prior learnt by the model.


Note the catastrophic failure in Sample #4, where the algorithm can't identify that the user is lying on the ground. This is alleviated in the samples below by modifying the algorithm.







Left is ground truth, right is InPose.

We address catastrophic failures by giving head position to CFG guidance. The modified algorithm improves the lower body pose estimates over the previous results (compare Sample #1 here with Sample #4 above).







Left is ground truth, right is InPose.

Outlier body shapes can also be handled by the inverse guidance. This robustness allows the utilization of training data from any subject.



Qualitative Comparison with Baselines


Interpolation end reference image.
Interpolation end reference image.

The above Figures present qualitative comparisons between InPose and some Baselines, for two scaling factors, 60% and 140%. Unlike the baselines, InPose is able to generalize to varied body scales due to the inverse formulation underlying our proposed algorithm. The baselines perform better for the default size, especially in the lower body, but degrade at the task of generalization. The errors are especially prominent in the 0.6 case, where the baselines predicts the lower body to be in a squatted pose because the measurements are generated by a user of short stature. Since the priors were learnt by BoDiffusion on data generated by a user of the default body shape, there is no way of informing the model of this difference. For the 140% case too, the baselines incur higher torso error.

Quantitative Comparisons with Baselines


Italy

(a) Position error vs. body scale.

Forest

(b) Rotation error vs. body scale.

Mountains

(c) Position error vs. location noise.

Fig. (a,b) presents results when the models are trained on a default body size, and then tested for various body sizes (including the default). The body sizes are varied by changing the scaling factor on the X axis (a value greater than 1.0 on the X axis indicates a proportionally taller human, and vice versa). Note, all bones of the taller person (or shorter) human has been scaled up (or down) by the same factor. The root joint translation is also proportional to this scaling factor.

As expected, the baselines are able to outperform InPose in the default case when scale equals 1.0. This is because they are trained for this default shape. However, both the scaled MPJPE and the MPJRE (Fig. (a,b)) remain almost flat for InPose regardless of body size. This demonstrates the zero-shot nature of our inverse solver in contrast to the significant degradation of the baselines.

InPose is designed to be implicitly robust to location measurement noise as well. We inject zero-mean i.i.d. Gaussian noise into the input location streams and compute the estimation errors, while maintaining the default body shape and the rotation measurements. This is an important test for practical applications since real-world wearable sensors---like watches and phones---have difficulty with measurement errors. Fig. (c) shows the location error under increasing Gaussian noise variance. Evidently, InPose stays flat while other baselines degrade with noise. This is expected because while the baseline models are sensitive to location noise, InPose uses the location only for inverse guidance, allowing the prior to play an important role in the final pose estimates.