MoMADiff

Towards Robust and Controllable Text-to-Motion
via Masked Autoregressive Diffusion

State Key Laboratory of Virtual Reality Technology and Systems, Beihang University
Hangzhou Innovation Institute, Beihang University
ACM MM 2025

*Indicates Corresponding Author

Abstract

Generating 3D human motion from text descriptions remains challenging due to the diverse and complex nature of human motion. While existing methods excel within the training distribution, they often struggle with out-of-distribution motions, limiting their applicability in real-world scenarios. Existing VQVAE-based methods often fail to represent novel motions faithfully using discrete tokens, which hampers their ability to generalize beyond seen data. Meanwhile, diffusion-based methods operating on continuous representations often lack fine-grained control over individual frames. To address these challenges, we propose a robust motion generation framework MoMADiff, which combines masked modeling with diffusion processes to generate motion using frame-level continuous representations. Our model supports flexible user-provided keyframe specification, enabling precise control over both spatial and temporal aspects of motion synthesis. MoMADiff demonstrates strong generalization capability on novel text-to-motion datasets with sparse keyframes as motion prompts. Extensive experiments on two held-out datasets and two standard benchmarks show that our method consistently outperforms state-of-the-art models in motion quality, instruction fidelity, and keyframe adherence. The code is available at: https://github.com/zzysteve/MoMADiff.

Motivation

Existing methods struggle with out-of-distribution motions

Inaccurate motion reconstruction

Poor keyframe guidance

Overview

Overview of our method

Our approach combines a Motion VAE for high-fidelity reconstruction with a masked autoregressive diffusion model for generation. This design enables realistic motion synthesis and allows for precise control via sparse keyframes.

Qualitative Results

Quantitative Results

Qualitative results on two standard benchmarks.

Qualitative results on two standard benchmarks. Our method outperforms existing state-of-the-art approaches on both HumanML3D and KIT-ML benchmarks in terms of motion quality and text-motion fidelity.

Quantitative comparison results.

Our method demonstrates strong generalization capability on out-of-distribution motions with sparse keyframes as motion prompts, significantly outperforming existing methods.

BibTeX

@article{zhang2025towards,
  title={Towards Robust and Controllable Text-to-Motion via Masked Autoregressive Diffusion},
  author={Zhang, Zongye and Kong, Bohan and Liu, Qingjie and Wang, Yunhong},
  journal={arXiv preprint arXiv:2505.11013},
  year={2025}
}