Skip to content

🎙️ InterActHuman: Audio-Driven Multi-Subject Video Generation with Precision Mask Guidance

🧠 Introduction

Overview:
Introduce InterActHuman as a novel diffusion transformer (DiT)-based framework that enables multi-concept, audio-driven human video generation. Highlight its ability to overcome traditional single-entity limitations by localizing and aligning multi-modal inputs for each distinct subject. Emphasize the use of an iterative, in-network mask predictor to infer fine-grained, spatio-temporal layouts for each identity, allowing precise injection of local cues, such as audio for accurate lip synchronization, into their specific regions during the video synthesis process.

Introduction to InterActHuman: Multi‑Concept Human Animation with Layout‑Aligned Audio Conditions

InterActHuman is a breakthrough diffusion-transformer (DiT) framework designed for multi-subject, audio-driven human video generation. Unlike conventional models that fuse all inputs globally and support single-entity animation, InterActHuman introduces an iterative mask predictor that learns spatio-temporal layouts for each identity—enabling precise injection of per-person audio, such as lip-sync cues, in their specific regions during the synthesis process .

Built on a DiT backbone with a multi-step denoising process, the model uses refined masks from previous steps to guide where audio and image conditions are injected in subsequent ones. This architecture enables accurate, synchronized animations for multiple individuals, including human-human and human-object interactions.

Key capabilities include:

  • Multi-Identity Animation: Allows different reference images alongside multiple audio streams, enabling multi-person dialogue videos .
  • Audio-Visual Coherence: Audio is injected locally within predicted masks, ensuring precise per-identity lip-sync without global blending.
  • Scalable Training Pipeline: Leveraging over 2.6 million annotated video-entity pairs, the method generalizes well across diverse scenarios .

By effectively disentangling and aligning per-identity audio and visual cues, InterActHuman sets a new standard in audio-conditioned multi-person video synthesis, outperforming prior approaches in lip-sync accuracy, subject consistency, and overall quality

🛠️ Methodology

Key Components:

  • Iterative Mask Prediction:
    Utilizes a mask predictor to infer spatio-temporal layouts for each identity, ensuring accurate localization of audio and visual cues.
  • Audio-Visual Alignment:
    Injects local audio conditions into corresponding regions, ensuring layout-aligned modality matching in an iterative manner.
  • Diffusion Transformer Backbone:
    Employs a DiT backbone with a multi-step denoising process, dynamically refining masks to guide local condition injection.
  • Training Pipeline:
    Trained on a scalable data pipeline with over 2.6 million annotated video-entity pairs, supporting applications like audio-driven multi-person video generation and multi-concept video customization such as human-object interaction.

📊 Benchmark Comparisons

Performance Metrics:

  • Lip-Sync Accuracy:
    Achieves state-of-the-art performance with a Sync-D score of approximately 6.67.
  • Motion Diversity:
    Demonstrates high motion diversity with an HKV score of around 59.64.
  • Video Quality:
    Outperforms baselines with a lower FVD score of approximately 22.88.

Ablation Studies:

  • Mask-Based Audio Injection:
    Confirms the effectiveness of the predicted dynamic mask strategy for local audio injection, showing that global audio, ID embedding, and fixed mask approaches lead to worse synchronization and/or video quality.t

🧪 User Studies

Findings:

  • Lip Synchronization:
    User studies validate InterActHuman’s superiority in multi-person lip synchronization.themoonlight.io
  • Subject Consistency:
    Demonstrates strong subject consistency across generated videos.
  • Visual Appearance Preservation:
    Preserves multi-concept visual appearances, outperforming several dedicated methods.themoonlight.io

🔓 Open Source & Installation

Access:
The project is available at InterActHuman GitHub.

Installation Guide:

  1. Clone the Repository: bashCopyEditgit clone https://github.com/InterActHuman/InterActHuman.git cd InterActHuman
  2. Set Up Python Environment: bashCopyEditpython3 -m venv venv source venv/bin/activate
  3. Install Dependencies: bashCopyEditpip install -r requirements.txt
  4. Download Pre-trained Models: bashCopyEditbash scripts/download_models.sh
  5. Run Example Workflow: bashCopyEditpython run.py --prompt "a conversation between two people" --output ./output_scene

Minimum Hardware & Software Requirements:

  • OS: Ubuntu 20.04+ / macOS 12+
  • CPU: Quad-core Intel/AMD or higher
  • GPU: NVIDIA RTX 3090 or A6000 (≥24 GB VRAM)
  • RAM: ≥32 GB
  • Storage: ≥100 GB SSD
  • Python: 3.9 – 3.11
  • CUDA: 11.6+ for GPU-accelerated training and inferencing

🔮 Future Work

Potential Enhancements:

  • Interactive Scene Editing:
    Introduce real-time control for users to modify masks, reposition agents, or update audio-injection layouts mid-generation.
  • Emotion and Expression Control:
    Incorporate emotion embeddings into the audio-attention module to enable expressive animations (e.g., happiness, sadness).
  • Long-Form and Conversational Flow:
    Extend support beyond short clips to sustained dialogue sequences with smooth topic transitions.
  • Cross-Domain Adaptation:
    Enable InterActHuman to animate different entity types (e.g., animals, robots) with the same multi-concept alignment strategy.
  • Lightweight & Edge Deployment:
    Explore sparse DiT variants or token pruning for on-device use and faster inference.

✅ Conclusion

Summary:
InterActHuman sets a new standard in audio-conditioned multi-person video synthesis by enabling fine-grained, multi-identity audio-guided animation. Its innovative approach to spatio-temporal mask alignment and local audio injection ensures high visual fidelity and synchronization across multiple subjects. With its open-source availability and robust performance metrics, InterActHuman paves the way for dynamic dialogue scenes, customizable videos, and richer human-centric media creation.

📚 References

  1. Wang, Z., Yang, J., Jiang, J., Liang, C., Lin, G., Zheng, Z., Yang, C., & Lin, D. (2025). InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions. arXiv preprint arXiv:2506.09984. Available at: arxiv.org
  2. “Local audio conditioning … injecting wav2vec features via cross-attention after the MMDiT layer…” in benchmark & method analysis. Available at: themoonlight.io
  3. Method performance metrics (Sync-D, HKV, FVD) and ablation insights from user study. Available at: [

Leave a Reply

Your email address will not be published. Required fields are marked *