LLaMo: Unified Motion Understanding and Generation

Abstract

LLaMo teaser

Recent progress in large models has led to significant advances in unified multimodal generation and understanding. However, the development of models that unify motion-language generation and understanding remains largely underexplored. Existing approaches often fine-tune large language models (LLMs) on paired motion-text data, which can result in catastrophic forgetting of linguistic capabilities due to the limited scale of available text-motion pairs. Furthermore, prior methods typically convert motion into discrete representations via quantization to integrate with language models, introducing substantial jitter artifacts from discrete tokenization.

To address these challenges, we propose LLaMo, a unified framework that extends pretrained LLMs through a modality-specific Mixture-of-Transformers (MoT) architecture. This design inherently preserves the language understanding of the base model while enabling scalable multimodal adaptation. We encode human motion into a causal continuous latent space and maintain the next-token prediction paradigm in the decoder-only backbone through a lightweight flow-matching head, allowing for streaming motion generation in real-time (>30 FPS). Leveraging the comprehensive language understanding of pretrained LLMs and large-scale motion-text pretraining, our experiments demonstrate that LLaMo achieves high-fidelity text-to-motion generation and motion-to-text captioning in general settings, especially zero-shot motion generation, marking a significant step towards a general unified motion-language large model.

Zero-shot Text-to-Motion Comparisons

Videos load as you scroll. Click any video to open the fullscreen player.

"A strong athlete is lifting heavy weights in the gym."

LLaMo-8B (Ours)

MotionMillion-7B

"A zombie slowly dragging its feet forward, arms outstretched, letting out a low groan."

LLaMo-8B (Ours)

MotionMillion-7B

"Raise both hands."

LLaMo-8B (Ours)

MotionMillion-7B

"Cross arms."

LLaMo-8B (Ours)

MotionMillion-7B

"A middle-aged couple greeting each other warmly after a long day, wrapping their arms around each other in a tight hug."

LLaMo-8B (Ours)

MotionMillion-7B

"The emaciated woman sat on the floor, her hands wrapped around her knees as she trembled, her head buried in her arms, sobbing, her body curled up."

LLaMo-8B (Ours)

MotionMillion-7B

"A surfer is riding a big wave."

LLaMo-8B (Ours)

MotionMillion-7B

"A hunter is stalking prey in the forest."

LLaMo-8B (Ours)

MotionMillion-7B

"A confident performer in a flashy costume strikes a dramatic pose, then leaps into a high-flying cartwheel across the stage."

LLaMo-8B (Ours)

MotionMillion-7B

"A woman practicing yoga, gracefully transitioning from a downward dog position to a cobra pose."

LLaMo-8B (Ours)

MotionMillion-7B

Multilingual Generation

LLaMo inherits the multilingual capability of pretrained LLMs, enabling motion generation directly from non-English prompts without any multilingual training.

EN: "An obese middle-aged male security guard, walking and looking around."
中: 一个肥胖的中年男保安，边走边环顾四周。
हि: एक मोटा मध्यम आयु का पुरुष सुरक्षा गार्ड, चलते हुए चारों ओर देख रहा है।

English

Chinese (中文)

Hindi (हिन्दी)

EN: "A man of average build who looked lost was walking along the street when a giant pie suddenly hit his head. He clutched his head with both hands and squatted down."
中: 一个身材中等、看起来迷茫的男人正沿着街道行走，突然一个巨大的馅饼砸中了他的头部。他双手抱头，蹲了下来。
हि: एक साधारण कद-काठी वाला आदमी, खोया हुआ सड़क पर चल रहा था जब अचानक एक विशाल पाई उसके सिर पर जा लगी।

English

Chinese (中文)

Hindi (हिन्दी)

EN: "A strong man in deep sorrow rushed to the scene, running while reaching out and shouting with a voice filled with despair, his steps faltering."
中: 一个身强力壮的男人，悲伤万分地冲向现场，一边奔跑一边伸出手臂，绝望地呼喊着，脚步踉跄。
हि: एक गहरे दूख में डूबा हुआ ताकतवर आदमी अपना हाथ बढ़ा के और निराशा भरी आवाज़ से चिल्लाते हुए दौड़ के वहाँ पहुँचा।

English

Chinese (中文)

Hindi (हिन्दी)

EN: "A professional dancer is performing a ballet solo."
中: 一位专业舞蹈演员正在表演芭蕾独舞。
हि: एक पेशेवर नर्तकी बैले का एकल प्रदर्शन कर रही है।

English

Chinese (中文)

Hindi (हिन्दी)

Citation

@inproceedings{li2026llamo,
  title     = {LLaMo: Scaling Pretrained Language Models for Unified Motion
               Understanding and Generation with Continuous Autoregressive Tokens},
  author    = {Li, Zekun and An, Sizhe and Tang, Chengcheng and Guo, Chuan and
               Shugurov, Ivan and Zhang, Linguang and Zhao, Amy and
               Sridhar, Srinath and Tao, Lingling and Mittal, Abhay},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision
               and Pattern Recognition (CVPR)},
  year      = {2026}
}

LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens

Abstract

Zero-shot Text-to-Motion Comparisons

Multilingual Generation

Citation