Tomascbzn

In this project, we will explore the posibilites of image synthesis and editing using GAN latent space. First, we will start by inverting a pre-trained generator to find a latent variable that closely reconstructs a given real image. In the second part of the assignment, we will interpolate between two images in the latent space and we will finisht with image editing. We will take a hand-drawn sketch and generate an image that fits the sketch accordingly, then wi will use these sketches to edit a given image.

This project is based in the following two articles: Generative Visual Manipulation on the Natural Image Manifold and Image2StyleGAN: How to Embed Images Into the StyleGAN Latent Space?.

Inverting the Generator

For the first part of the assignment, you will solve an optimization problem to reconstruct the image from a particular latent code. Natural images lie on a low-dimensional manifold and we choose to consider the output manifold of a trained generator as close to the natural image manifold. So, we can set up the following nonconvex optimization problem:

For some choice of loss $L$ and trained generator $G$ and a given real image $x$ , we can write

We choose a combination of pixel and perceptual loss, as the standard Lp losses do not work well for image synthesis tasks. We also tried BCE loss but didn't give good results. For the implementation of this part of this assigment we rehuse what we learnt in assigment 4, Neural Style transfer. As this is a nonconvex optimization problem where we can access gradients, we attempt to solve it with any first-order or quasi-Newton optimization method (in our case, LBFGS).

Perceptual and Pixel Loss: The content loss is a metric function that measures the content distance between two images at a certain individual layer. Denote the Lth-layer feature of input image X as f_X^Land that of target content image as f_C^L. The content loss is defined as squared L2-distance of these two features:

$L=\frac{1}{2}\sum(f_X^L-f_C^L)^{2}$
To extract the feature, a VGG-19 net pre-trained on ImageNet is used. The pre-trained VGG-19 net consists of 5 blocks (conv1-conv5) (with a total of 15 conv layers) and each block serves as a feature extractor at the different abstract levels. As we saw in previous assigment, the layer election has a big the influence on the results. For this assigment, we have used conv_5 layer, as it outputs best results.

For pixel loss, we implement the L1 loss over the pixel space.

Results

In this part of the assgiment we try different possible combinations of layers of VGG-19, different perceptual and pixel loss weights as well as using different latent spaces (z, w and w+). We have seen that the best results are the ones using conv_5 layer, 0.01 weight for the perceptual loss and 10. for the pixel loss. Nevertheless, we will use other weights for the following parts of the assigment. We also compare the outputs of a vnailla GAN and a styleGAN and as it was expected, the seconds outputs better results. We optimize the images during 1000 iterations as more optimization time does not result in better output quality.

Original image

Vanilla GAN z

StyleGAN z

StyleGAN w

StyleGAN w+

Interpolations

We use StyleGAN and w+ space to embedded two images into the latent space and output the images of their interpolation. In the following images and GIFs we can observe that the transitions are smooth and neat.

Original image A

Embedded image A

Interpolation

Embedded image B

Original image B

Original image A

Embedded image A

Interpolation

Embedded image B

Original image B

Original image A

Embedded image A

Interpolation

Embedded image B

Original image B

In the top right images we can see how the plants disapear on the embedded images, this is a good example of how stylegans keep the important overall features of the data they are trained on but dont learn smaller details.

Scribble to Image

We can treat the scribble similar to the input reconstruction. In this part, we have a scribble and a mask, so we can modify the latent vector to yield an image that looks like the scribble.To generate an image subject to constraints, we solve a penalized non-convex optimization problem. We’ll assume the constraints are of the form fi(x)=vi for some scalar-valued functions fi and scalar values vi. Written in a form that includes our trained generator G, this soft-constrained optimization $G$ problem is:

Given a user color scribble, we would like GAN to fill in the details. Say we have a hand-drawn scribble image $s \in \R^d$ with a corresponding mask $m \in {0, 1}^d$ . Then for each pixel in the mask, we can add a constraint that the corresponding pixel in the generated image must be equal to the sketch, which might look like $m_{i} x_{i} = m_{i} s_{i}$ Since our color scribble constraints are all elementwise, we can reduce the above equation under our constraints to

where $*$ is the Hadamard product, $M$ is the mask, and $S$ is the sketch. For the results bellow, we have used a perceptual loss weight of 0.05 and L1 loss of 5.

Sketch

Embedded image

Sketch

Embedded image

Sketch

Embedded image

Sketch

Embedded image

Í

Image Editing

Similar to in previous section, we will use the perceptual and pixel loss to edit an images. However, in this case we will embed the initial image in the latent space and then apply the sketch to edit it. In the following images some of the results of this image editing are shown. This images have been optained using conv_4 layer to calculate the loss. We can observe how some of the colors in the sketches are not present in the GAN latent space and therefore the output images, so similar colors, but not the same ones.

Initial image

Sketch

Result

Initial image

Sketch

Result

Initial image

Sketch

Result

Initial image

Sketch

Result

Initial image

Sketch

Result

Initial image

Sketch

Result

Bells & Whistles

High Resolution Grumpy Cats

We used a higher resolution GAN to generate out more detailed grumpy cats and their interpolations! Here are the results:

Original image A

Embedded image A

Interpolation

Embedded image B

Original image B

Original image A

Embedded image A

Interpolation

Embedded image B

Original image B

User interface and demo

In the following gifs, two possible user interfaces and interaction can be seen. The user is able to draw the edition sketches and the model optimizes the best image output that matches the initial image and the edition sketch.

What about Kim?

In the paper Image2StyleGAN: How to Embed Images Into the StyleGAN Latent Space? they embed different image classes on the StyleGan latent space trained on the FFHQ dataset. They show that the model is capable of embedding the images even was not trained on those image classes. On the paper we can seee that even these images can be embedded in the latent space, the interpolation between them leads to images with features of the image class it was trained with, in their case, faces. We decided to see what happens with our model when we try to embed images that are not cats. There couldn't be another images to be embedded than Kim Kardashian and the Spanish Ecce Omo. On the following images we can see that, unlike in the paper, our network is not capable or reconstructing the images. In the interpolations, as expected, we can also see the cat features that the network has learned.

Original image A

Embedded image A

Interpolation

Embedded image B

Original image B

.

In this project, we will explore neural style transfer which resembles specific content in a certain artistic style. For example, generate cat images in Ukiyo-e style. The algorithm takes in a content image, a style image, and another input image. The input image is optimized to match the previous two target images in content and style distance space.

In the first part of the assignment, we will start from random noise and optimize it in content space. It will help us get familiar with the general idea of optimizing pixels with respect to certain losses. In the second part of the assignment, you will ignore content for a while and only optimize to generate textures. This builds some intuitive connection between style-space distance and gram matrix. Lastly, we combine all of these pieces to perform neural style transfer.

This project is based in two articles by Gatys et al. Texture Synthesis Using Convolutional Neural Networks and A Neural Algorithm of Artistic Style. The oficial Pytorch tutorial can be found here.

Content Reconstruction

For the first part of the assignment, you will implement content-space loss and optimize a random noise with respect to the content loss only.

Content Loss: The content loss is a metric function that measures the content distance between two images at a certain individual layer. Denote the Lth-layer feature of input image X as f_X^Land that of target content image as f_C^L. The content loss is defined as squared L2-distance of these two features:

$L=\frac{1}{2}\sum(f_X^L-f_C^L)^{2}$
To extract the feature, a VGG-19 net pre-trained on ImageNet is used. The pre-trained VGG-19 net consists of 5 blocks (conv1-conv5) (with a total of 15 conv layers) and each block serves as a feature extractor at the different abstract levels. In the following images the influence of the layer election can be seen, higher layers capture content better than lower layers. Actually, from conv_1 to conv_3 there is barely difference, however, on higher layers, for example with conv_11, the content of the image can be barely seen, and the image is mainly noise.

Original

conv_3

conv_5

conv_7

conv_9

conv_11

Conv_5 layer, my favorite, will be used to reconstruct other images from an input noise:

Original image

input_noise

conv_5

Original image

input_noise

conv_5

Texture Synthesis

Now we will implement style-space loss. Style loss: How do we measure the distance of the styles of two images? In the course, we discussed that the Gram matrix is used as a style measurement. Gram matrix is the correlation of two vectors on every dimension. Specifically, denote the k-th dimension of the Lth-layer feature of an image as f^L_k in the shape of (N,K,H∗W). Then the gram matrix is

$G=f_{k}^{L}\left(f_{k}^{L}\right)^{T}$
in the shape of (N, K, K). The idea is that two of the gram matrix of our optimized and predicted feature should be as close as possible. In the following images we can see the influence of the different style layers selected to generate the texture. Lower layers mantain the color better, while deeper layer don't mantain them and show more a noisy color. However, deeper layers mantain some elements, for example, in the bottom left, some eyes can be seen in the texture image.

Original

conv_1

conv_1 conv_2 conv_3

conv_1 to conv_5

conv_1 to conv_7

conv_5 conv_6 conv_7

conv_1 to conv_10

conv_7 to conv_10

conv_10 to conv_13

The results using the layers from conv_1 to conv_7 are the favorites, so this arrangement will be used to generate textures from noise.

Original image

input_noise

conv_5

Original image

input_noise

conv_5

Style transfer

Finally, it is time to put pieces together! We will use conv_5 as content feature and conv_1-conv_7 as style feature. The style weight is set to 1000000 and the content weight is set to 1. We use L-BFGS optimizer to and optimized input image for 300 steps.

The following images show the style transfer of two different styles on two different images.

Style 1

Style 2

Original image

In the following images the influence between the noise and content image initialization can be seen. In terms of time both take the same time, however, if we tried different number of epochs to change the results, the times may change. On the other hand, we can see that the content initialization mantains the texture better, actually, in the sky, we can see similar strokes as the ones in Van Gogh's masterpiece. Furthermore, the overall structure of the image is better, the elements can be better distinguised.

Style image

Content Image

Noise image initialization
(CPU times: user 25.3 s)

Content image initialization
(Wall time: 25.9 s)

Now let's try this style transfer on... Kiki! She is always around while I do the homework, so she will enjoy seeing herself in different styles!

Kiki Kahlo

Style (Kahlo)

Kiki

La Maja Kiki

Style (La Maja Vestida de Goya)

Kiki

Kiki Miro

Style (Miro)

Kiki

Grumpy cat DCGAN training outputs

In this project, we will explore coding and training of GANs (Generative Adversarial Networks). This assignment is divided into two parts: in the first part, we will implement a specific type of GAN designed to process images, called a Deep Convolutional GAN (DCGAN). We’ll train the DCGAN to generate cats from samples of random noise. In the second part, we will implement a more complex GAN architecture called CycleGAN, which was designed for the task of image-to-image translation. We’ll train the CycleGAN to convert between different types of two kinds of cats (Grumpy and Russian Blue) and apples to oranges.

Deep Convolutional GAN

or the first part of this assignment, we will implement a Deep Convolutional GAN (DCGAN) as introduced by Radford et al. A DCGAN is simply a GAN that uses a convolutional neural network as the discriminator, and a network composed of transposed convolutions as the generator. To implement the DCGAN, we need to specify three things: 1) the generator, 2) the discriminator, and 3) the training procedure. We will develop each of these three components in the following subsections.

Discriminator

The discriminator in this DCGAN is a convolutional neural network that has the following architecture:

In each of the convolutional layers shown above, we downsample the spatial dimension of the input volume by a factor of 2. Given the input-output relation equation and that we use kernel size K = 4 and stride S = 2, the padding for each convolution is:

Bellow a summary of the discriminator implemented can be seen. The sizes of the outputs of each layer can be seen as well as the number of parameters that were trained. After each convolution operation, ReLU activation has been used a except from the last layer, the 5th Conv2d layer.

Generator

The generator of the DCGAN consists of a sequence of transpose convolutional layers(we will implement upsampling and posterior conv2d) that progressively upsample the input noise sample to generate a fake image.

Before each of the convolutional layers shown above, we upsample the spatial dimension of the input volume by a factor of 2. Given the input-output relation equation and that we use kernel size K = 3 and stride S = 1, the padding for each convolution is:

Bellow a summary of the generator implemented can be seen. The sizes of the outputs of each layer can be seen as well as the number of parameters that were trained. After each convolution operation, ReLU activation has been used a except from the last layer that used a Tanh activation.

Training the loop

Next, we implemented the training loop for the DCGAN. A DCGAN is simply a GAN with a specific type of generator and discriminator; thus, we train it in exactly the same way as a standard GAN. The pseudo-code for the training procedure is shown below.

Results

To train the DCGAN we have used basic (normalization) and deluxe data augmentation techniques (random crop and horizontal flips).In the following plots, we can see the influence of the data augmentation techniques. One of the main problems when training GANs is overfitting, this occurs when the data used for trainning is so small that the model memorizes it, which deteriorates the performance. Therefore, using data augmentation techniques increases the number of training examples improving the the training.

Losses using basic data augmentation,
200 iterations

Losses using deluxe data augmentation,
200 iterations

To further improve the data efficiency of GANs, we have also applied differentiable augmentations discussed in this paper. In the plots bellow we can see the influence of this tecnique. The discriminator loss is higher this time, as it is more difficult to differentiate real images from the fake ones, which also makes the generator loss to reduce.

Losses using basic data augmentation,
200 iterations

Losses using deluxe data augmentation,
200 iterations

The outputs of training the models for 200 iterations and 6500 iterations are the following:

- For basic data augmentation:

Basic data augmentation, 200 iterations

Basic data augmentation, 6500 iterations

- For deluxe data augmentation:

Deluxe data augmentation, 200 iterations

Deluxe data augmentation, 6300 iterations

In the previous images we can see the influence of this data augmentation technique in the quality of the output images, even when the number of iterations for training is the same.

- For deluxe data augmentation with differentiable augmentation:

Deluxe data augmentation, 200 iterations

Deluxe data augmentation, 6500 iterations

Here we can see how the model crashed when used differentiable augmentation and thee model doesn't improve:

Losses using delux and differentiable data augmentations,
6500 iterations

CycleGAN

In the second part of this assigment, we are going to implement a CycleGAN as introduced by Zhu et al. CycleGANs are particularly interesting because they allow to use un-paired training data. This means that in order to train a model to translate images from domain X to domain Y , we do not have to have exact correspondences between individual images in those domains as it is the case for image-to-image translation.

Generator

The generator in the CycleGAN has layers that implement three stages of computation:

1) the first stage encodes the input via a series of convolutional layers that extract the image features;
2) the second stage then transforms the features by passing them through one or more residual blocks;
3) the third stage decodes the transformed features using a series of transposed convolutional layers, to build an output image of the same size as the input.

The residual block used in the transformation stage consists of a convolutional layer, where the input is added to the output of the convolution. This is done so that the characteristics of the output image (e.g., the shapes of objects) do not differ too much from the input.

Bellow a summary of the generator implemented can be seen. The sizes of the outputs of each layer can be seen as well as the number of parameters that were trained. After each convolution operation, ReLU activation has been used a except from the last layer that used a Tanh activation.

PatchDiscriminator

CycleGAN adopts a patch-based discriminator. Instead of directly classifying an image to be real or fake, it classifies the patches of the images, allowing CycleGAN to model local structures better. To achieve this effect, we reduce the spatial outputs to a dimension of 4x4 instead of a scalar, 1x1, as before.

Bellow a summary of the discriminator implemented can be seen. The sizes of the outputs of each layer can be seen as well as the number of parameters that were trained. After each convolution operation, ReLU activation has been used a except from the last layer.

Training the loop

To train the CycleGan we implement the following training procedure:

Cycle Consistency

The most interesting idea behind CycleGANs (and the one from which they get their name) is the idea of introducing a cycle consistency loss to constrain the model. The idea is that when we translate an image from domain X to domain Y, and then translate the generated image back to domain X, the result should look like the original image that we started with. The cycle consistency component of the loss is the mean squared error between the input images and their reconstructions obtained by passing through both generators in sequence (i.e., from domain X to Y via the X->Y generator, and then from domain Y back to X via the Y->X generator). The cycle consistency loss for the Y->X->Y cycle is expressed as follows:

Results

In the following image we can see the influence of the cycle consistency loss in the output results. In the first two images the results of training from domain X to Y and viceversa with and withouth cycle consistency are shown:

X -> Y: from Russian Blue to Grumpy

Without cycle consistency,1000 iterations and PatchDiscriminator

With cycle consistency,1000 iterations and PatchDiscriminator

Y -> X: from Grumpy to Russian Blue

Without cycle consistency,1000 iterations and PatchDiscriminator

With cycle consistency,1000 iterations and PatchDiscriminator

In the previous image we can seee that introducing the cycle consistency loss improves the results, and reduces the visual artifacts. Therefore, the second training, with cycle consistency loss, has been continued until 10000 iterations. The results can be seen bellow:

With cycle consistency X -> Y,10000 iterations and PatchDiscriminator

With cycle consistency Y -> X,10000 iterations and PatchDiscriminator

In the following experiments, we can compare the previous results, using PatchDiscriminator, with the results using the previous DCDiscriminator. We can see that the PatchDiscriminator, another of thee differences between DCGAN and CycleGAN improves the results for domain transformation.

With cycle consistency X -> Y,10000 iterations and DCDiscriminator

With cycle consistency Y -> X,10000 iterations and DCDiscriminator

The same experiments were carried out with the apple/orange dataset, observing the same results, the cycle consistency loss as well as using PatchDiscriminator improved the results.

X -> Y: from apples to oranges

Without cycle consistency,1000 iterations and PatchDiscriminator

With cycle consistency,1000 iterations and PatchDiscriminator

Y -> X: from oranges to apples

Without cycle consistency,1000 iterations and PatchDiscriminator

With cycle consistency,1000 iterations and PatchDiscriminator

Results after 10000 iterations using PatchDiscriminator:

With cycle consistency X -> Y,10000 iterations and PatchDiscriminator

With cycle consistency Y -> X,10000 iterations and PatchDiscriminator

Results after 10000 iterations using PatchDiscriminator or DCDiscriminator:

With cycle consistency X -> Y,10000 iterations and DCDiscriminator

With cycle consistency Y -> X,10000 iterations and DCDiscriminator

I think it is very interesting that in the orange dataset, there are a lot of images of the open fruit, however, in the apple dataset they are very little. Furthermore, oranges are orange in the inside, so same color inside as outside, but apples, are white in the inside. The problem has trouble learning this an in the results we can see apples that are red on the inside. Another thing that catched my atttention, was that the big mayority of the apples are red, so when the model sees a green apple, it doesn't convert it into an orange.

Bells & Whistles

Spectral normalization

Applies spectral normalization to a parameter in the given module.

Spectral normalization stabilizes the training of discriminators in Generative Adversarial Networks (GANs) by rescaling the weight tensor with spectral norm σ(sigma) of the weight matrix calculated using power iteration method.

Results after 6400 training iterations

Losses using basic data augmentation and spectral normalization

When Kim Kardashian meets GANs

If I was going to create a dataset to train a GAN it couldn't be any other than the queen of the selfies. To train this GAN I collected a dataset of 148 images of her intagram:

When Kim Kardashian meets CycleGANs

I didn't manage to do train a good GAN for Kim K because of the small dataset I generated (or maybe because she is so unique and irrepliclable 😜). Also, I realized, that as we commented in class, the images should have been better preprocessed, for example, aligning the face in all the training images.

Nevertheless, I got a meme:

If you don't understand the meme, you can take a look at this spanish famous Ecce Homo Fresco restoration.

Original image

Naive blend

Poisson blend

In this project, we will exprole gradient-domain processing, a simple technique with a broad set of applications including blending, tone-mapping, and non-photorealistic rendering. For this assigment, we will focus on 'Poisson blending', 'mixed gradients' and 'color2gray'.

The primary goal of this assignment is to seamlessly blend an object or texture from a source image into a target image. The method presented above is called “Poisson blending” and uses the gradients of both of the images that we want to combine to make transition as smooth as possible. This was introduced by Perez et al. in this 2003 paper.

In the previous image, in the left, we have the target image, in which we want to add another image, what we will call, the source image, in this case, a bear. Next to it, we have the naive blend, a simple copy and paste using a mask. This however, doesn't give good result, therefore, on the right, Poissong blending has been applied.

Process

Gradients

To understand Poisson, blending, first we need to understand what a gradient is. As in calculus, a gradient is the derivative of a function, in this case, the derivative of each pixel. But how do we calculate this? To do so, we have to start by thiking what a derivative is: the rate of change of a function in a given direction. In the case of pixels, this is calculated by comparing the different values of the pixels in a given direction. Let's imagine the following 3x3 image with values for the each pixel of [[0, 1, 2],[3, 4, 5],[6, 7, 8]]. The derivatives of the p4 pixel will be defined by its 4 neighbours: p3 to the left, p5 to the right, p0 going up, and p7 going down. So the derivatives can be written as follow:

← p3-p4=3-4=-1
→ p5-p4=5-4=+1
↑ p1-p4=1-4=-3
↓ p7-p4=7-4=+3

Toy problem

Before implementing the Poisson blending algorithm, we are asked to solve a toy problem. In this example we’ll compute the x and y gradients from an image s, then use all the gradients, plus one pixel intensity, to reconstruct an image v. If the implementation is correct, the output should recover the input image. Let's denote the intensity of the source image at (x, y) as s(x,y) and the values of the image to solve for as v(x,y). For each pixel, then, we have two objectives:

Minimize ((v(x+1,y)−v(x,y))−(s(x+1,y)−s(x,y)))², so the x-gradients of v should closely match the x-gradients of s.

Minimize ((v(x,y+1)−v(x,y))−(s(x,y+1)−s(x,y)))², so the y-gradients of v should closely match the y-gradients of s.

Apart from these twho objectives, we also have to minimize (v(1,1)−s(1,1))² as the top left corners of the two images should be the same color.

The result after minimizing the objectives:

Poisson blending

In order to make a seamless transition between any two images we need to think about the gradients of both of the images rather than about the overall intensity. This problem consists in finding the right values for the target pixels that maximally preserve the gradient of the source region, without changing any of the background pixels. Note that we are making a deliberate decision to ignore the overall intensity, so somo color change could occur, as seen before, a brown bear could turn black, but it would still look like a bear.

Target image (t)

Source image (s)

Mask (region S)

Naive blend

Poisson blend

We can formulate our objective as a least squares problem. Given the pixel intensities of the source image “s” and of the target image “t”, we want to solve for new intensity values “v” within the source region “S”:

In the previous formula, we can see that we are summating the pixels of the region “S”. This region represents the points of the source image that we want to copy in the target image. For this task, we were given the following code, to create the mask and align it with both the source and target images. In the formula,each “i” is a pixel in the source region “S”, and each “j” is a 4-neighbor of “i”. Each summation guides the gradient values to match those of the source region. In the first summation, the gradient is over two variable pixels; in the second, one pixel is variable and one is in the fixed target region. In the first part, we set the gradients of “v” inside "S" while on the second part, we set the gradients around the boundary of “S".

To solve for v, we have used the scipy.sparse.linalg.lsqr function. This function minimizes our least squares problem with the form of (Av-b)². It returns the v values that minimize the gradients and that are used to generate the output image.

Results

Kiki travels

For this example I have tried to take Kiki, my cat, around so she can explore new places. I have used the Poisson blending algorithm to take her with me to CMU on an snowy day. I have also taked her to the beach, a place I could also take myself too...

Kiki in CMU:

Target image (t)

Source image (s)

Mask (region S)

Naive blend

Poisson blend

Kiki in the beach:

Target image (t)

Source image (s)

Mask (region S)

Naive blend

Poisson blend

I am pretty surprised with the results! However, some artifacts can be seen in the image of Kiki at CMU. Between her ears the fence of the tenis court disappears. On the beach case, the output image makes Kiki too dark due to the gradients. Furthermore, the blending is not as good as in the previous image. This is because the background colors matched better in the previous example.

Naruto's Deidara

When I was young I used to read Naruto's comics. Back then I was a fan of one of the characters, Deidara, that had mouths in her hands:

Using this algorithm I have been able to be like her, at least virtually.

Target image (t)

Source image (s)

Mask (region S)

Naive blend

Poisson blend

Trying jewelry

I really enjoy designing rings or other kind of wearables, so I thougt that this could be a great opportunity to inquire the possibilities of poisson blending algorithm for trying jewelry.

Target image (t)

Source image (s)

Mask (region S)

Naive blend

Poisson blend

I think this is a great tool to create rapid mockup images whithout espending a lot of time on making a perfect mask around the ring. In this example we can also see what we mentioned before, that this algorithm changes the color of the objects. Actually, the color of the silver ring dissapears as it fuses with the color of the skin.

After seeing this results, I wondered how this algorithm would work with a more complex shape... so I tried it!

Target image (t)

Source image (s)

Mask (region S)

Naive blend

Poisson blend

This time the results are not as surprising as before. Mainly because I lost all my hair in the procces due to the blending. To solve this problem, I applied the mixed gradients algorithm in the bells and whistles of this assigment.

Bells and Whistles

Mixed gradients

To solve the above seen problem, in this section we will implement the mixed gradients algorithm for images where transparency important. In this algorithm we follow the same steps as in Poisson blending, but instead of using the gradients in the source image, we use the gradient with larger magnitude in either source or target image as the guide:

The results of applying this algorithm to the previous image

Naive blend

Poisson blend

Mixed gradients

Mixed gradients works very well in this example. Nevertheless, this algorithm makes the source image too transparent. This is something we need to have into account when deciding which algorithm we are going to use. For example in the following images, we can see the results of applying mixed gradients to the Kiki goes to the beach image.

Naive blend

Poisson blend

Mixed gradients

Although both of the images are seamlessly blended, when applying this algorithm, the source image may seem to became more transparent, is we zoom in, we can see that the beach can be seen thorugh Kiki's fur. This is because inside the region S, we now also consider the gradients of the target image.

Color2gray

For this part of the assigment, we will see another application of these blending algorithms, color2gray transformation. When converting a color image to grayscale (e.g., when printing to a laser printer), we lose the important contrast information, making the image difficult to understand. To see this, we will use the images used for testing color blindness. As it can be seen bellow, when this images are converted into grayscale, no longer show the numbers. To solve this, we are going to use this blending techniques to create a gray image that has similar intensity to the rgb2gray output but mantaining the contrast of the original RGB image.

To do this, we first convert the RGB image into HSV(Hue Saturation Value) space. In the image bellow we can see the example image as an RGB image on the left, and next to it the correspondent images of each of the HSV channels.

In the HSV space, we can examine the color of an image, the intensity of that image and the brightness. The image representing the brightness, is similar to the rgb2version. Therefore, to create our color2gray version, we will use the S and V channels of the image and approach it as a mixed gradients problem. We will use the white pixels of the original image to generate a mask. The results of this color2gray:

Sergey Mikhaylovich Prokudin-Gorsky was a chemist and photographer of the Russian Empire. He is best known for his pioneering work in color photography and his effort to document early 20th-century Russia. In imitation to the way the human eye senses color, in his pothograps the visible spectrum of colors was divided into three channels of information by capturing it in the form of three black-and-white photographs, one taken through a red filter, one through a green filter, and one through a blue filter.

Those original negative images (available in the Library of Congress) are going to be used in this assignment to compose the color photographs. Due to the way the images were captured using three different cameras, as it can be seen in the image above, the negatives are not aligned. Therefore, to generate the RGB images, these negatives first need to be preprocessed.

Proccees

To generate the final image, we need to divide the initial image into red, green, and blue channels. Actually, the original glass negatives are ordered in blue, green, and red order from top to bottom Then, as we can see in the gif image, we have to find the correct alignment of the three taking one of them as the base, in our case the blue channel. Once the alignment is found, we can combine the three channels.

Search Methods

As matching criteria, two different metrics have been considered, the Sum of the Square Differences (SSD) and the normalized Cross-Correlation (NCC):

The Sum of Squared Differences is calculated based on the following equation,

$SSD=\sum_{w,h} \left ( F_{w,h} - G_{w,h} \right )^2$
where F and G are both of the arrays we are comparing and h,w the corresponding pixels at a given height and width. The best alignment will be the one with the lower NCC value, this is the argmax of the previous equation.

The Normalized Cross-Correlation is calculated based on the following equation where as the name suggests,both of the arrays are first normalized

$NCC=\frac{\sum _{w,h}(F_{w,h}-\mu F_{w,h})(G_{w,h}-\mu G_{w,h})}{\sqrt{\sum _{w,h}(F_{w,h}-\mu F_{w,h})^2}\sqrt{\sum _{w,h}( G_{w,h}-\mu G_{w,h})^2}}$
where μF and μG are the average of F and G respectively. The bestalignment will be the one with the highest value

Image Pyramid

For higher dimension images, the previous brute force approach is not feasible, as the number of possible alignment combinations increases, which is translated in a higher computation time. For those images, the pyramid algorithm is used. This algorithm consists in creating a multi-scale representation of the image. On each level the image is reduced by half, to do so, previous two the downsampling, a Gaussian filter is applied to prevent wrong aliasing in the process. In my case, I have downsampled the images until reached a dimension similar to the cathedral image, where we have seen that the initial search on a 15x15 grid is feasible. The search for the correct alignment is started with this smaller dimension of the image and then is translated to higher dimensions. Every time we look for the correct alignment on a higher dimension image, the previous displacement is taken into account and a new 3x3 displacement grid is considered.

Bells and Whistles

Automatic cropping

Due to the alignment process, we can see that the borders of the resulting photographs have strange colors due to the three channels displacement and the black and white borders of the original glass negatives. To find those borders a Sobel filter has been applied both vertically and horizontally. The absolute values of the outputs have been considered and then combined vertically and horizontally to find the respective borders.

Displacement vectors. G:(48, 24) R:(102, 42)
Crops: (126, 118, 222, 317)

Gradient for the alignment

Some of the images, such as the emir or the village images, may be more difficult to align. This is because our alignment metric is based on the pixel values of the three images, however, this may not be a good approach as the pixel values may vary a lot in the different channels, as it is the case of the emir’s clothing, that is blue, so it will have higher pixel values in that channel but not in the others. In that case, we can use the same search metric but instead of using the pixel values, we could use other image features, as it can be the gradient to find the best alignment.