Readout Guidance: Learning Control
from Diffusion Features

1Google Research 2UC Berkeley

CVPR 2024
TL;DR: We train very small networks called "readout heads" to predict useful properties and guide image generation.

Abstract

We present Readout Guidance, a method for controlling text-to-image diffusion models with learned signals. Readout Guidance uses readout heads, lightweight networks trained to extract signals from the features of a pre-trained, frozen diffusion model at every timestep. These readouts can encode single-image properties, such as pose, depth, and edges; or higher-order properties that relate multiple images, such as correspondence and appearance similarity. Furthermore, by comparing the readout estimates to a user-defined target, and back-propagating the gradient through the readout head, these estimates can be used to guide the sampling process. Compared to prior methods for conditional generation, Readout Guidance requires significantly fewer added parameters and training samples, and offers a convenient and simple recipe for reproducing different forms of conditional control under a single framework, with a single architecture and sampling procedure. We showcase these benefits in the applications of drag-based manipulation, identity-consistent generation, and spatially aligned control.


Extracting Readouts

Given a frozen text-to-image diffusion model, we train parameter-efficient readout heads to interpret relevant signals, or readouts, from the intermediate network features. For Stable Diffusion XL, a readout head is at most 5.9M parameters, or 35MB. Training requires as few as 100 paired examples, since the heads bootstrap pre-trained diffusion features. During sampling, a readout head can be used at any timestep to query the model's current estimate of a particular property (for the image being generated).

Readout Guidance


These readouts can also be used for controlled image generation---by guiding the readout towards some desired value. Readouts for single-image concepts, such as pose and depth, enable spatially aligned control. Readouts for relative concepts between two images, such as appearance similarity and correspondence, enable cross-image controls, such as drag-based manipulation, identity-consistent generation, or image variations.


Spatially Aligned Control

We can perform pose, depth, or edge-guided generation, similar to ControlNet.

input type:
prompt:


Drag-Based Manipulation

We can enable drag-based manipulation of real and generated images.

Generated Images
prompt:
edit:




Real Images
prompt:

We can do the same manipulations on real images without per-sample optimization, where we guide against features derived from inverting the real image.

Identity Consistency

We can guide a generated image to contain a specific identity, defined by a reference image.

identity:
prompt:


Image Variations

With the appearance similarity head, we can explore different amounts of image variation by changing the guidance weight.

prompt:

RG Weight 0

Acknowledgements

We would like to thank Stephanie Fu, Eric Wallace, Yossi Gandelsman, Dave Epstein, Ben Poole, Thomas Iljic, Brent Yi, Kevin Black, Jessica Dai, Ethan Weber, Rundi Wu, Xiaojuan Wang, Luming Tang, Ruiqi Gao, Jason Baldridge, Zhengqi Li, and Angjoo Kanazawa for helpful discussions and feedback.

BibTeX


    @inproceedings{luo2024readoutguidance,
      title={Readout Guidance: Learning Control from Diffusion Features},
      author={Grace Luo and Trevor Darrell and Oliver Wang and Dan B Goldman and Aleksander Holynski},
      journal={CVPR},
      year={2024}
    }