Improved Motion Generation with Enhanced Compression: MotionLCM-V2

2024-12-11

This article introduces MotionLCM-V2, a significant advancement in the field of text-to-motion generation. It surpasses the state-of-the-art (SOTA) in three key aspects: motion generation quality, motion-text alignment capability, and inference speed. Additionally, the article presents MLD++, a substantial improvement over the original MLD for text-to-motion generation.

What is MotionLCM?

MotionLCM accelerates inference by leveraging latent consistency distillation from a teacher model, MLD. MLD’s generation capability ultimately limits MotionLCM’s effectiveness. Therefore, enhancing MLD’s generation performance is crucial for improving MotionLCM.

Addressing

The authors explored two key areas to improve

1. Eliminating Structural Defects: The original MLD architecture had two structural flaws identified in the denoising transformer.
The VAE latent tokens bypassed a learnable linear layer, hindering their integration with the model.
A ReLU activation function on the text feature suppressed valuable information.

The authors introduced two operations to rectify these issues:

Op1: A trainable linear layer after VAE latent tokens for better signal modulation.
Op2: Removal of the ReLU function to preserve negative components in the text feature.

These modifications significantly improved both motion generation quality and text alignment capability.

2. Enabling Multi-latent-Token Learning:

Motion latent diffusion relies on the

The original MLD limited itself to single-latent-token learning due to an uncontrolled compression rate when using multiple latent tokens. This hindered its ability to generate high-quality motions.

Introducing MLD++

The article proposes MLD++, a solution that overcomes the limitations of single-latent-token learning. MLD++ incorporates a latent adapter, a linear layer that adapts the dimension of the embedded distribution parameters to directly control the size of the latent space. This elegant design allows MLD++ to leverage the power of multiple latent tokens while maintaining control over the compression rate, resulting in a more compact latent space for successful diffusion.

Experiments demonstrate that MLD++ surpasses the prior approach, B2A-HDM, which utilizes a complex multi-denoiser framework. MLD++ achieves superior performance with a single denoiser due to its effective control over the compression rate.

MotionLCM-V2: Distillation and Advancement

By leveraging the superior MLD++ as the teacher model, MotionLCM-V2 achieves significant improvements in distillation performance compared to MotionLCM-V1. This translates to advancements in all three key aspects of text-to-motion generation: inference speed, motion generation quality, and text alignment capability.

Overall, MotionLCM-V2 establishes itself as a new SOTA in text-to-motion generation, paving the way for even more expressive and high-fidelity motion creation through text descriptions.

References:

Reported By: Huggingface.co
https://www.pinterest.com
Wikipedia: https://www.wikipedia.org
Undercode AI: https://ai.undercodetesting.com

Image Source:

OpenAI: https://craiyon.com
Undercode AI DI v2: https://ai.undercode.help

Listen to this Post