Zero-shot Human Pose Estimation using Diffusion-based Inverse solvers

Anonymous Authors

We propose full body 3D pose estimation from sparse sensors as an inverse problem, for which we learn the prior through a diffusion model.

Our core observation is that any human's full-body pose can be decomposed into a ''scale-free pose'' and a scale-dependent component. For human poses, the scale-free pose can be imagined as a template human body whose skeletal joints (e.g., shoulders, elbows, hip, knees, etc.) are rotated appropriately to create a given pose. The scale-dependent component is the location of the joints in 3D space. Forward kinematics relates the scale-free pose, along with the body size, to the scale-dependent component. Since the sensors give <location, rotation> measurements from 3 body joints, it is possible to estimate a distribution of scale-free poses from rotational measurements alone. Then, the location measurements can be used to sharpen this distribution to poses that best explain jointly both the rotation and location measurements. This decomposition lends itself to an inverse problem formulation, shown visually in the Figure above. Using <location, rotation> measurements from 3 body joints---head and two wrists---InPose aims to track the locations of all 22 body joints, necessary to fully define the full 3D pose of a human.

Left is ground truth, right is InPose.

Each of the above samples shows pose tracking results for a few second segments. Errors are shown using colors (more red indicates higher error). InPose figures out lower body movement using the prior learnt by the model.

Note the catastrophic failure in Sample #4, where the algorithm can't identify that the user is lying on the ground. This is alleviated in the samples below by modifying the algorithm.

Left is ground truth, right is InPose.

We address catastrophic failures by giving head position to CFG guidance. The modified algorithm improves the lower body pose estimates over the previous results (compare Sample #1 here with Sample #4 above).

Left is ground truth, right is InPose.

Outlier body shapes can also be handled by the inverse guidance. This robustness allows the utilization of training data from any subject.

Fig. (a,b) presents results when the models are trained on a default body size, and then tested for various body sizes (including the default). The body sizes are varied by changing the scaling factor on the X axis (a value greater than 1.0 on the X axis indicates a proportionally taller human, and vice versa). Note, all bones of the taller person (or shorter) human has been scaled up (or down) by the same factor. The root joint translation is also proportional to this scaling factor.

As expected, the baselines are able to outperform InPose in the default case when scale equals 1.0. This is because they are trained for this default shape. However, both the scaled MPJPE and the MPJRE (Fig. (a,b)) remain almost flat for InPose regardless of body size. This demonstrates the zero-shot nature of our inverse solver in contrast to the significant degradation of the baselines.

InPose is designed to be implicitly robust to location measurement noise as well. We inject zero-mean i.i.d. Gaussian noise into the input location streams and compute the estimation errors, while maintaining the default body shape and the rotation measurements. This is an important test for practical applications since real-world wearable sensors---like watches and phones---have difficulty with measurement errors. Fig. (c) shows the location error under increasing Gaussian noise variance. Evidently, InPose stays flat while other baselines degrade with noise. This is expected because while the baseline models are sensitive to location noise, InPose uses the location only for inverse guidance, allowing the prior to play an important role in the final pose estimates.

Zero-shot Human Pose Estimation using Diffusion-based Inverse solvers

Problem Statement

Method

We propose full body 3D pose estimation from sparse sensors as an inverse problem, for which we learn the prior through a diffusion model.

Results

Left is ground truth, right is InPose.

Left is ground truth, right is InPose.

Left is ground truth, right is InPose.

Qualitative Comparison with Baselines

Quantitative Comparisons with Baselines