ICML 2026

Semantic-Aware Motion Encoding for Topology-Agnostic Character Animation

State Key Laboratory of Virtual Reality Technology and Systems, Beihang University
Topology-agnostic motion encoding Zero-shot cross-species retargeting Human and animal motion generation

Abstract

Generalizing motion representation across diverse characters remains challenging due to significant topological variations in skeletal structures across datasets and species, which hinders the development of scalable generative models. To bridge this gap, we propose a Semantic-Aware Topology-Agnostic framework that learns a unified latent manifold shared by disparate species. Unlike methods relying on fixed hierarchies or rigid padding strategies, our approach leverages a semantic modulation mechanism to align functional joint correspondences, thereby decoupling motion from topology. This design enables the construction of a continuous, generative-friendly motion space from large-scale, unaligned raw BVH data. Experiments on human and animal datasets demonstrate that our framework achieves high-fidelity reconstruction and supports downstream text-to-motion tasks. Notably, the model enables zero-shot cross-species retargeting without paired data.

Motivation

Zero-shot cross-species retargeting teaser

Existing motion models often assume a canonical skeleton or rely on rigid padding. SATA instead injects functional joint semantics into a topology-agnostic encoder, allowing the same motion representation to drive disparate skeletal structures.

Overview

Overview of the SATA motion autoencoder

The framework combines Semantic-Aware Feature Modulation with Spatio-Temporal Interleaved Graph Blocks. A VAE or RVQ-VAE bottleneck regularizes the latent motion manifold, while target semantic and structural priors guide decoding onto arbitrary skeletons.

Videos

Qualitative Results

Zero-Shot Cross-Species Retargeting

Zero-shot cross-species retargeting comparison

Zero-shot retargeting comparison across diverse species. SATA maintains structural stability and preserves nuanced motion semantics in cross-topology scenarios, while the baseline can suffer from distortion and motion loss.

In-Domain Human Retargeting

Human retargeting qualitative results

Qualitative retargeting results on AT-HumanML3D. Given a single source motion, SATA decodes it to multiple target characters with distinct body proportions and skeletal scales.

Text-to-Motion Generation

Text-to-motion qualitative results

Text-to-motion generation for human and animal subjects. Both continuous VAE and discrete RVQ variants generate motions aligned with textual descriptions.

Core Quantitative Results

A compact view of the metrics that best communicate the main claims: zero-shot cross-dataset robustness, retargeting fidelity, and text-to-motion learnability. For cross-topology transfer, we highlight global trajectory and joint-position errors because they directly reflect whether the decoded motion remains structurally stable. Lower is better except Top-3.

Human-to-Animal JP
398.14 -> 34.59
SAME vs. Ours under zero-shot transfer
Cross Retargeting
0.960 -> 0.197
SAME vs. Ours (VAE) on AT-HumanML3D
Text-to-Motion FID
1.628 -> 0.158
SAME vs. Ours (RVQ) on AT-HumanML3D

Zero-Shot Cross-Dataset Generalization

Setting Method RT JP Takeaway
Train Human -> Test Animal SAME 191.46 398.14 Severe cross-topology drift
Ours 15.593 34.585 Keeps target structure stable
Train Animal -> Test Human SAME 110.51 122.69 Struggles with human topology
Ours 57.228 80.908 Better RT/JP fidelity

Retargeting on AT-HumanML3D

MethodInternalCross
MoMask89.421103.72
SAN15.964734.8207
SAME1.48420.9604
Ours (RVQ)1.12380.9669
Ours (VAE)0.21220.1974

Text-to-Motion on AT-HumanML3D

MethodFIDMMDTop-3
Upper Bound (Ground Truth)
GT0.0002.9010.808
Text-to-Motion Generation
SAME1.6284.6610.554
Ours (VAE)1.2264.5760.563
Ours (RVQ)0.1583.9550.639