Kubrick: Multimodal Agent Collaborations for Video Generation

1Purdue University, 2Baidu USA

A short demo film, crafted with our work on agent-based pipeline for generating controllable and physically plausible videos

Abstract

We build the first multimodal agent-based video generation pipeline through 3D engine scripting. Given any text prompt, multimodal agents collaborate to produce detailed Blender scripts to generate video with plausible character and motion consistency in any length.

Multi-agent Collaboration Framework

Our video generation pipeline consists of multiple agent collaborations, and multi-modal reflection loops on both generated visual results and code libraries. The LLM Director agent takes the user query and transform it to detailed functional process. Then the LLM Programmer agent composes corresponding Blender python scripting given in-context function libraries for the set of basic processes. Each intermediate screenshots and video outputs will be reflected by VLLM Reviewer agent for key feature evaluation. The reflections will feedback to Programmer agent for recursive improving. In particular, the Reviewer agent and the function libraries will be instructed by retrieving public video tutorials and online documents to fasten the reflection loops.

Illustration of The Framework

Framework image.

Demos

Pure Simulation with Input Prompt: Show Spider-Man jumping off a skyscraper in a metropolitan city at night, emphasizing his bold descent against the city lights.


Rendered with AnimateDiff.


Pure Simulation


Pure Simulation with Input Prompt: Ironman blocks bullets, runs, boxing hooks. Man in black uniform gets knocked over. Ironman flies away.

The generated videos are easily customizable for different characters.

The generated videos are also easily controllable (camera angle for instance).


Camera Control Input Prompt: Zooming out, man looking like Elon Musk, cheering and waving his arms, rocket, cybertruck behind.


Camera Control Input Prompt: Fast zooming in, man looking like Elon Musk, cheering and waving his arms, rocket, cybertruck behind.


Anime character dva relaxing in her bedroom, sitting down to snack and play video games.


A wounded female knight faltering and dramatically collapsing in a poisonous forest


A man in business casual and a blonde princess dancing on the stage, facing each other, camera flys close to the blonde princess.