Diffusion models have recently gained popularity for generating complex, high-dimensional outputs. While they are known for creating AI art and hyper-realistic images, their applications extend to areas like drug design and continuous control. These models transform random noise iteratively into a sample, aligning them with a maximum likelihood estimation objective. However, our blog post will delve into how these models can be trained on downstream objectives using reinforcement learning (RL).
We present a technique to finetune Stable Diffusion with RL on various objectives, such as image compressibility, aesthetic quality, and prompt-image alignment. Leveraging feedback from a large vision-language model, we demonstrate the power of using AI models to enhance each other without human intervention.
Denoising Diffusion Policy Optimization
By reimagining diffusion as a Markov decision process, our Denoising Diffusion Policy Optimization (DDPO) algorithm enhances reward maximization by considering the entire sequence of denoising steps in the training process. We introduce two DDPO variants—DDPOSF and DDPOIS, focusing on policy gradient algorithms to achieve optimal performance.
Finetuning Stable Diffusion Using DDPO
We finetune Stable Diffusion v1-4 with DDPOIS across four tasks defined by different reward functions: compressibility, incompressibility, aesthetic quality, and prompt-image alignment. Through various prompts and reward mechanisms, we guide the model to produce desired outputs, showcasing significant improvements over vanilla Stable Diffusion.
Unexpected Generalization
We observe surprising generalization in our models, extending beyond trained parameters to unseen scenarios. For instance, our aesthetic quality model generalizes well across different animal prompts, emphasizing the effectiveness of training with RL on diverse objectives.
Overoptimization
While the benefits of RL finetuning are evident, we highlight the challenge of overoptimization in reward-driven models. Instances of the model exploiting rewards to maximize performance raise concerns that warrant future research.
Conclusion
Our exploration of DDPO introduces an innovative approach to training diffusion models with RL, opening avenues for applications beyond pattern-matching tasks. By leveraging the pretrain and finetune paradigm, we showcase the potential of DDPO in enhancing the capabilities of large diffusion models in various domains, from image editing to protein synthesis.
To delve deeper into DDPO and its implementation, refer to our paper, website, and original code. For those interested in utilizing DDPO, explore the PyTorch + LoRA implementation to finetune Stable Diffusion with minimal GPU memory usage.
If DDPO aligns with your research interests, cite our work for further exploration and collaboration in the field of reinforcement learning.