Despite the recent advancements in video stylization, most existing methods struggle to render videos with complex transitions based on open-ended user style descriptions. To address this, we introduce V-Stylist, a generic multi-agent system for video stylization that leverages a novel collaboration and reflection paradigm of multi-modal large language models. Specifically, V-Stylist is a systematic workflow with three key roles: (1) The Video Parser decomposes the input video into shots and generates text prompts for key shot content, enabling effective handling of complex transitions; (2) The Style Parser identifies user query styles and searches for matched models in a style tree, specifying vague style preferences through a robust tree-of-thought approach; (3) The Style Artist renders shots into the desired style using a multi-round self-reflection paradigm to adaptively adjust detail control. This design, mimicking human professionals, achieves a major breakthrough in effective and automatic video stylization. Moreover, we construct a new benchmark, Text-driven Video Stylization Benchmark (TVSBench), to evaluate complex video stylization based on open user queries. Extensive experiments show that V-Stylist achieves state-of-the-art performance, surpassing FRESCO and ControlVideo by 6.05% and 4.51% respectively in overall average metrics, marking a significant advance in video stylization.
@misc{yue2025vstylistvideostylizationcollaboration,
title={V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents},
author={Zhengrong Yue and Shaobin Zhuang and Kunchang Li and Yanbo Ding and Yali Wang},
year={2025},
eprint={2503.12077},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.12077},
}