ICML 2026

Semantic-Aware Motion Encoding for Topology-Agnostic Character Animation

Zongye Zhang, Yuzhuo Cui, Qingjie Liu, Yunhong Wang

State Key Laboratory of Virtual Reality Technology and Systems, Beihang University

Paper Videos Code (coming soon) Demo (coming soon)

Topology-agnostic motion encoding Zero-shot cross-species retargeting Human and animal motion generation

Abstract

Generalizing motion representation across diverse characters remains challenging due to significant topological variations in skeletal structures across datasets and species, which hinders the development of scalable generative models. To bridge this gap, we propose a Semantic-Aware Topology-Agnostic framework that learns a unified latent manifold shared by disparate species. Unlike methods relying on fixed hierarchies or rigid padding strategies, our approach leverages a semantic modulation mechanism to align functional joint correspondences, thereby decoupling motion from topology. This design enables the construction of a continuous, generative-friendly motion space from large-scale, unaligned raw BVH data. Experiments on human and animal datasets demonstrate that our framework achieves high-fidelity reconstruction and supports downstream text-to-motion tasks. Notably, the model enables zero-shot cross-species retargeting without paired data.

Motivation

Zero-shot cross-species retargeting teaser

Existing motion models often assume a canonical skeleton or rely on rigid padding. SATA instead injects functional joint semantics into a topology-agnostic encoder, allowing the same motion representation to drive disparate skeletal structures.

Overview

The framework combines Semantic-Aware Feature Modulation with Spatio-Temporal Interleaved Graph Blocks. A VAE or RVQ-VAE bottleneck regularizes the latent motion manifold, while target semantic and structural priors guide decoding onto arbitrary skeletons.

Videos

Zero-shot cross-species retargeting. The model transfers motions across humans and animals without paired supervision.

In-domain human retargeting. Human motions are reconstructed across characters with different proportions and skeletal scales.

Text-to-motion generation. VAE and RVQ variants decode text-conditioned latent predictions into human and animal motions.

Motion reconstruction. The encoder preserves global motion semantics and local pose fidelity on AT-HumanML3D and AT-AniMo4D.

Qualitative Results

Zero-Shot Cross-Species Retargeting

Zero-shot retargeting comparison across diverse species. SATA maintains structural stability and preserves nuanced motion semantics in cross-topology scenarios, while the baseline can suffer from distortion and motion loss.

In-Domain Human Retargeting

Qualitative retargeting results on AT-HumanML3D. Given a single source motion, SATA decodes it to multiple target characters with distinct body proportions and skeletal scales.

Text-to-Motion Generation

Text-to-motion generation for human and animal subjects. Both continuous VAE and discrete RVQ variants generate motions aligned with textual descriptions.

Core Quantitative Results

A compact view of the metrics that best communicate the main claims: zero-shot cross-dataset robustness, retargeting fidelity, and text-to-motion learnability. For cross-topology transfer, we highlight global trajectory and joint-position errors because they directly reflect whether the decoded motion remains structurally stable. Lower is better except Top-3.

Human-to-Animal JP

398.14 -> 34.59

SAME vs. Ours under zero-shot transfer

Cross Retargeting

0.960 -> 0.197

SAME vs. Ours (VAE) on AT-HumanML3D

Text-to-Motion FID

1.628 -> 0.158

SAME vs. Ours (RVQ) on AT-HumanML3D

Zero-Shot Cross-Dataset Generalization

Setting	Method	RT	JP	Takeaway
Train Human -> Test Animal	SAME	191.46	398.14	Severe cross-topology drift
Train Human -> Test Animal	Ours	15.593	34.585	Keeps target structure stable
Train Animal -> Test Human	SAME	110.51	122.69	Struggles with human topology
Train Animal -> Test Human	Ours	57.228	80.908	Better RT/JP fidelity

Retargeting on AT-HumanML3D

Method	Internal	Cross
MoMask	89.421	103.72
SAN	15.9647	34.8207
SAME	1.4842	0.9604
Ours (RVQ)	1.1238	0.9669
Ours (VAE)	0.2122	0.1974

Text-to-Motion on AT-HumanML3D

Method	FID	MMD	Top-3
Upper Bound (Ground Truth)
GT	0.000	2.901	0.808
Text-to-Motion Generation
SAME	1.628	4.661	0.554
Ours (VAE)	1.226	4.576	0.563
Ours (RVQ)	0.158	3.955	0.639