UnityVideo: Unified Multi-Modal Multi-Task Learning
for Enhancing World-Aware Video Generation

Jiehui Huang1,† Yuechen Zhang2 Xu He3 Yuan Gao4 Zhi Cen4
Bin Xia2 Yan Zhou4 Xin Tao4 Pengfei Wan4 Jiaya Jia1
1HKUST 2CUHK 3Tsinghua University 4Kling Team, Kuaishou Technology

Overall Method

UnityVideo Pipeline

Figure: Overview of the UnityVideo Framework

⏳ It may take some time to load all videos. Thank you for your patience!

📹 Teaser Videos

JointGen

Sample 1 - View 1
Sample 1 - View 2
Sample 2 - View 1
Sample 2 - View 2
Sample 3 - View 1
Sample 3 - View 2
Sample 4 - View 1
Sample 4 - View 2

Estimator

Estimation 1
Estimation 2
Estimation 3
Estimation 4
Estimation 5
Estimation 6

ControGen

Control 1-1
Control 3-2
Control 1-2
Control 4-1
Control 3-1
Control 4-2
Control 2-1
Control 2-2

✨ Method Showcases

JointGen - Text to Video

T2A 1 - RGB
T2A 1 - Skeleton
T2A 2 - RGB
T2A 2 - Segmentation
T2A 3 - RGB
T2A 3 - Segmentation
T2A 4 - RGB
T2A 4 - RAFT

Estimator - Video to Modality

V2F 1 - RGB
V2F 1 - Skeleton
V2F 2 - RGB
V2F 2 - Skeleton
V2F 4 - RGB
V2F 4 - RAFT
V2F 3 - Depth
V2F 3 - DensePose

ControGen - Modality to Video

F2V 1 - Depth
F2V 1 - RGB
F2V 2 - RAFT
F2V 2 - RGB

🔍 Baseline Comparisons

Case 0 - Wan
Case 0 - UnityVideo
Case 1 - Hunyuan
Case 1 - UnityVideo
Case 2 - Hunyuan
Case 2 - UnityVideo
Case 3 - Hunyuan
Case 3 - UnityVideo
Case 4 - Hunyuan
Case 4 - UnityVideo
Case 5 - Hunyuan
Case 5 - UnityVideo
Case 6 - Hunyuan
Case 6 - UnityVideo
Case 7 - Hunyuan
Case 7 - UnityVideo
Case 8 - UnityVideo
Case 8 - VACE
Case 9 - UnityVideo
Case 9 - VACE