Retrieving Conditions from Reference Images
for Diffusion Models

Motivation

In order to generate good stylized avatars, both identity and style need to be preserved. A lot of existing methods do not separate the two which results in images that are hard to control with prompts. Styles of input images leak into result images and result images lack the style and diversity given by text prompts. We solve this problem by training with multiples of same identity images with a new architecture. The generation also does not require additional training at inference time.

Results on faces

One can input up to four images of a person and a text prompt to generate stylized avatar images instantly. The demo below loops through 149 sets of sample images. The first dozens demonstrate same prompt with different identities, then various styles are demonstrated.