Icon-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents

1Shanghai Jiao Tong University 2OpenGVLab, Shanghai AI Laboratory 3Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences 4University of Chinese Academy of Sciences
*Corresponding Author

Abstract

Despite the recent advancements in video stylization, most existing methods struggle to render videos with complex transitions based on open-ended user style descriptions. To address this, we introduce V-Stylist, a generic multi-agent system for video stylization that leverages a novel collaboration and reflection paradigm of multi-modal large language models. Specifically, V-Stylist is a systematic workflow with three key roles: (1) The Video Parser decomposes the input video into shots and generates text prompts for key shot content, enabling effective handling of complex transitions; (2) The Style Parser identifies user query styles and searches for matched models in a style tree, specifying vague style preferences through a robust tree-of-thought approach; (3) The Style Artist renders shots into the desired style using a multi-round self-reflection paradigm to adaptively adjust detail control. This design, mimicking human professionals, achieves a major breakthrough in effective and automatic video stylization. Moreover, we construct a new benchmark, Text-driven Video Stylization Benchmark (TVSBench), to evaluate complex video stylization based on open user queries. Extensive experiments show that V-Stylist achieves state-of-the-art performance, surpassing FRESCO and ControlVideo by 6.05% and 4.51% respectively in overall average metrics, marking a significant advance in video stylization.

What is V-Stylist?

Teaser

Role of Style Parser

Teaser

Role of Video Parser

Teaser

Role of Style Artist

Teaser

Visualization

Teaser

BibTeX

@misc{yue2025vstylistvideostylizationcollaboration,
      title={V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents}, 
      author={Zhengrong Yue and Shaobin Zhuang and Kunchang Li and Yanbo Ding and Yali Wang},
      year={2025},
      eprint={2503.12077},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.12077}, 
}