Style Transfer
Recently, I've been having a lot of fun playing with this TensorFlow tutorial on style transfer by Magnus Erik Hvass Pedersen based off of the paper Image Style Transfer Using Convolutional Neural Networks by Gatys, Ecker, and Bethge. The basic goal in style transfer is to take an image and redraw it in the artistic style of a separate image. The final result should match the content of the first image and the style of the second.
For example, using the tutorial to combine a photograph of Delicate Arch with the style of Bruegel's Jagers in de Sneeuw, I obtained the following image:
Gatys, Ecker, and Bethge's suggested using a deep convolutional neural network to define both the content and the style of an image. The convolutional filters identify different features of an image: colors, corners, and edges in the early layers and more complicated features built from these in the later layers. Gatys, Ecker, and Bethge's idea was that these feature activations can be used to capture both content and style.
The main goal of this post is to work towards a better understanding of what this algorithm is doing, especially what is meant by style and how it is captured. We'll approach this from a few different angles. In particular, we'll generate a few "pure style" images starting from photographs with strong stylistic features in order to get an idea of what does and what does not count towards style. All of the generated style images, along with the images they were based upon, can be found in the blog materials Github repository (along with a few examples that didn't make it into this blog post).
If you want to play with style transfer yourself but don't want to work directly with neural networks, you can make use of the algorithm in black-box fashion with DeepArt.
What is Content?
Capturing content is fairly straightforward: if you can generate an image having the same feature activations as the original content image — that is, the images display the same shapes in the same locations — your two images should have similar content. Of course, if we match all of the feature activations perfectly, we risk reconstructing the content image too closely, thus defeating the purpose of style transfer. So we need to decide which layers of the convolutional network to use for content.
One way to get an idea of what structure is captured by each layer of the network is to perform style transfer without a style image. That is, start with a white noise image and try and try to reconstruct our content image using only specific layers.
A perfect reconstruction would take the image on the left as input and produce the image on the right as output.
We're working with the VGG-16 convolutional neural network described in this paper. The network contains 13 convolutional layers using small (3x3) convolutional filters. Let's see what we get when we use different layers in the network for the content:
Matching the features at the first layer reproduces the image almost perfectly. By the thirteenth layer, the arch is almost unrecognizable. As we increase the layer we're working with, the colors match less and less and the edges become less sharp.
Note that, when working with the earlier layers, we're not explicitly matching the complex features. Instead we get them for free by matching the smaller building blocks: if you know that you have two eyes sitting above a nose and a mouth, it's quite likely that you have a face.
What is Style?
It might be tempting to try and capture style in the exact same way we did for content: try to match the features in the lower levels of the network to the style image. But we saw that matching the features in the low levels automatically matches the features in the high levels as well and we don't want to overwrite the content information. Furthermore, the feature activations themselves carry information about location and it's reasonable to expect style to be a global property of the image. Gatys, Ecker, and Bethge's idea was to not directly match the feature activations, but instead try to match some of their statistical properties. The tool they use for this is the Gram matrix.
So what stylistic elements are captured by Gram matrices?
We're going to explore this question both quantitatively and qualitatively. We'll begin with a highly simplified example — 2x2 black and white images — and work directly with Gram matrices to understand how they relate to style. Later on we'll use style transfer to generate images with style and no content. By comparing the original style images to the generated images, we can get a feeling as to what is counted as style when using the VGG-16 network.
Put another way, we'll start with some math and then celebrate with some fun pictures.
Gram Matrix Definition
The Gram matrix of a set of vectors \(v_1, v_2, \ldots, v_n\) is the matrix whose \((i,j)\)-th entry consists of the dot product of \(v_i\) with \(v_j\). If we let \(V\) denote the matrix whose \(j\)-th column is \(v_j\), so \(V = [v_1 \; v_2 \; \cdots \; v_n]\), then the Gram matrix of these vectors can be represented as the product \(V^T V\).
The Gram matrix is related to the covariance matrix: If the vectors are centered random variables (each entry of a fixed vector is drawn from the a common distribution with mean 0), then the Gram matrix is proportional to the sample covariance matrix. If the vectors are not centered, then the Gram matrix seems to capture a mix of information about both the means and the covariances.
In both cases, the Gram matrix can be thought of as delocalizing the feature activation information. The Gram matrix captures some information about the feature activations, but it doesn't know in which part of the image these features are activated.
A Spherical Cow
Let's work with a toy example. Like all spherical cows, it has its limitations, but is a good place to start. Suppose we are working with 2x2 black and white images. We'll represent the intensity of each pixel on a scale from 0 (black) to 1 (white). We'll apply to each image a single 1x1 convolutional filter corresponding to the identity: it takes the pixel intensity as input and returns the same value as the output.
For example, if we take the sample image on the right, the feature activations form the vector \(v = [1, 0, 0, 0]\). In this case, the Gram matrix has a single entry which is the dot product of this vector with itself:
Images with the same gram matrix can be represented by matrices \(A = (a_{ij})\) satisfying
Here are some examples of images satisfying this property:
Both the mean and the variance of the pixel intensities vary from one image to the next with a sort of trade-off: as the mean intensity increases, the variance decreases.
Covariance matrices are more familiar objects and more easily interpretable so let's see what we get when if we center the activations before computing the Gram matrix. The mean value of the coordinates of \(v = [1, 0, 0, 0]\) is \(1/4\). Subtracting, we get the vector \(w = [3/4, -1/4, -1/4, -1/4]\) and the dot product of \(w\) with itself is
For lack of a better term, we'll call this the centered Gram matrix.
Each image with the same centered Gram matrix can be represented by a matrix of pixel intensities \(A\) satisfying
where \(\bar{a}\) is the mean of the entries:
Here are some examples of images with this property:
The pixel intensities of these images all have the same fixed variance of 3/4. However, the mean intensities do vary from one image to the next. So, for example, one could add or subtract 0.05 to each of the intensities in the third matrix without altering the centered Gram matrix.
The Gram matrix and centered Gram matrix give different results. Which represents style better? It's difficult to say, especially with our constrained example. What happens if we impose both constraints? That is, what do we get if we require both the Gram matrix and the centered Gram matrix to match those of the original image?
It turns out this is the same as fixing both the variance of the activations (and thus pixel intensities) and the activation means. We already know that the constraint on the centered Gram matrix fixes the variance. Recall that
Using our constraints on the Gram matrix and centered Gram matrix, this reduces to
Because each of the \(a_{ij}\) is nonnegative, it follows that \(\bar{a} = 1/4\) or, equivalently, that the sum of the four activations is \(1\). It turns out that there are exactly four images satisfying these properties. To see this, consider our constraints:
and note that
Since \(0 \leq a_{ij} \leq 1\), it follows that each of the cross terms \(a_{ij} a_{k\ell}\) is zero and, therefore, that at most one of the \(a_{ij}\) is nonzero. The four images sharing these constraints are given below:
To me, these appear to have the same style.
Further Exploration
I hope the example above was illuminating. We should be wary of drawing too strong of conclusions from such a simple example, but it seemed like a manageable place to start, and suggests some directions for further exploration. What if we work with 4x4 images and 2x2 filters? What if we use a different parametrization with black corresponding to -1 and use rectified linear activations? How should we measure the distance between styles when the Gram matrix is larger than 1x1?
We won't pursue any of these ideas any further here. Instead I'll take the easy way out and leave their exploration as an exercise for the interested reader.
Visualizing Style
In the next part, we're going to try to understand style by using style transfer. Earlier, we tried generating Delicate Arch using a content image and no style. Now, we're going to generate images using only style and no content. The naming convention for the original images is a little haphazard (I didn't realize how far I would take this when I just got started and didn't feel like renaming them afterwards) but matches the file names in the Github repository.
The More Rigid Style
To capture style, instead of using the Gram matrix by itself, we'll use both the Gram matrix and the centered Gram matrix. Despite the discussion above, this decision was based more on trial-and-error than on theory. While playing around with style transfer and trying to understand what was going on, I replaced the Gram matrix with the centered Gram matrix. Often this improved the patterns in the generated image but messed up the colors. After experimenting a little, I decided to try and incorporate both matrices into my algorithm and suddenly the styles I was generating seemed a lot more accurate. I'll try and demonstrate this with some examples below.
(An aside: When performing style transfer, you may be less interested in generating the most stylistically accurate image than in generating an aesthetically pleasing one. In that case, I would recommend experimenting a bit with how you choose to measure style.)
Let's begin with a simple example: what is the style of a dark circle on a light background? In the figure below, the image on the left was used as the style image. The three other images were generated to match its style using just the regular Gram matrix, just the centered Gram matrix, and both Gram matrices, respectively.
The image generated using the Gram matrix is mostly gray with a few concentration of light or dark points. The image generated using the centered Gram matrix does a better job of achieving the light and dark colors, but the balance is off: too many dark regions, not enough light regions. Using both matrices, we get a nice compromise: a good balance between light and dark, curved rather than straight edges, and mostly contiguous blobs.
Let's look at some more examples.
In some cases, the regular Gram matrix outperformed the centered Gram matrix. The colors are quite a bit off in the this muddy style generated using the centered matrix:
In other cases, the centered Gram matrix outperformed the regular Gram matrix. The regular Gram matrix failed to capture the purple color of some of these peppers:
And sometimes each captured different aspects of the style.
But in all cases, at least to my eye, the images generated using a combination of both the regular Gram matrix and the centered Gram matrix most accurately captured the style of the original image they were based on.
To incorporate both Gram matrices, I modified the gram_matrix function in the tutorial. The modified function I used is presented below:
def gram_matrix(tensor): shape = tensor.get_shape() # Get the number of feature channels for the input tensor, # which is assumed to be from a convolutional layer with 4-dim. num_channels = int(shape[3]) # Reshape the tensor so it is a 2-dim matrix. This essentially # flattens the contents of each feature-channel. matrix = tf.reshape(tensor, shape=[-1, num_channels]) # Subtract the mean activation value from each feature so that # our Gram matrix is proportional to the covariance matrix means = tf.reduce_mean(matrix, axis = 0, keepdims = True) centered_matrix = matrix - means # Calculate the Gram-matrix as the matrix-product of # the 2-dim matrix with itself. This calculates the # dot-products of all combinations of the feature-channels. gram = tf.matmul(tf.transpose(matrix), matrix) # Do the same with the centered matrix gram_centered = tf.matmul(tf.transpose(centered_matrix), centered_matrix) # Return both Gram matrices return tf.concat([gram, gram_centered], axis = 0)
Different Style Layers
Just like content, the style of an image is different depending on which layers are considered. Let's start by working with the first layer of the convolutional network and add layers one by one.
Using only the first layer seems to produce the right colors, but not much else. Adding the second layer, larger patches of matching colors begin to appear. When the third layer is added, little flame-like shapes begin to appear and these get larger and more leaf-like when the fourth and fifth layers are added into the mix. After the fifth or sixth layer, things don't seem to change much. This might be an optimization problem — maybe if we used a more sophisticated optimization technique and let our algorithm run longer, we'd get a different result — or it could just be that the effect of the higher layers is too subtle to notice.
Since we tend to associate style with low level features (e.g. colors and edges), it makes sense to set an upper bound on which layers to use for style and include all of the layers below that. The style images generated in the previous section and those that we'll look at in the next section used all 13 layers of the network when considering style. Just to get an idea about the stylistic contribution of each individual layer, let's see what happens if we work with the layers one at a time.
As we work with higher and higher layers, the patterns become more and more complex while the colors match those of the original image less and less.
Fashion Show
Now that we've decided how we're measuring style (using both the Gram matrix and centered Gram matrix for the feature activations from all 13 convolutional layers), we can try and understand style by looking at examples.
Contrived
Let's begin with some simple, contrived examples.
The halves style image is mostly separated into large light and dark patches with long straight edges on the boundaries. With the checkers style, we get sharp corners, lots of straight vertical and horizontal edges, and a good balance of red and black. With the rainbow circles style, most of the edges are curved and the ordering of the colors is preserved (e.g. red patches are nested between orange and purple). The lorem ipsum style doesn't contain any identifiable characters and the neat line spacing has been lost, but still displays many text-like properties.
Mostly Style
One way to understand style is to identify images that seem to be mostly style. Below are some examples where the generated style images closely resemble the images they're based on.
We did lose a little bit of content (e.g. the crack between the barnacle covered rocks), but for the most part, the generated images could fill in for the originals if one doesn't look too closely. I think images generated using the style of mud 4 and rock 1 are particularly convincing.
Not Just Style
Conversely, we can look at images that have content in addition to style. You'd be unlikely to confuse the original images below with their associated generated style images.
Indeed, given only the generated images (those on the bottom row), you'd likely have a hard time determining what the original images (top row) might have been. It probably comes as no surprise that the photograph of the bighorn sheep is more than just style. Nevertheless, it's good to confirm our definition of style isn't too broad.
Flowers
Images of flowers provide a good case study of the balance between style and content. Even though they display many strong stylistic elements, individual flowers tend to contain more content (the arrangement of the petals and leaves, the position of the flower relative to the background), while images containing many of the same type of flowers together (flower 5 above and flower 17 below) are closer to pure style.
The flower images also highlight a common pattern of our algorithm: often the generated image pushes the bright color of the flower to the border of the image (e.g. flowers 7, 8, 10, and 12).
Surprises
Though I couldn't perfectly predict the appearance of the generated style images in advance, I found most of the results unsurprising. There were a few, however, that caught me off guard.
It seems the forest 1 and redrock 2 images contained more content than I thought. I was expecting to see straight dark tree trunks in the generated forest canopy. And I hadn't considered how important the location of the shadows is in the red rock image. The lilypad 1 image, on the other hand, seems to be less content and more style than I initially expected. In particular, the flowers in the generated image seem quite plausible.
Rogues Gallery
Thanks for reading! I hope that these examples have shed some light on what elements are treated as style for the purposes of style transfer (at least for the VGG-16 network). I had a lot of fun generating these images and I'll leave you with a few of my personal favorites: