Sketch2Anime: Don’t feel bad about your drawings.

Aslan Shi
10 min readDec 2, 2020

By: Aslan, Hehan, Nico, Lei, Tingyi

The above image comes from Danbooru2019: https://www.gwern.net/Danbooru2019

Do you love anime? If so, are you familiar with the term “Waifu”? Well, if you answer yes to both of the questions above, this blog may well catch your interest. But please don’t turn back if you don’t. Our work is still fun to read :)

We are a team of people who answer yes to both of the questions. We love watching anime, love using anime characters as our avatars, and love to have our own anime characters. But the problem is…we’re quite bad at drawing. By bad, I mean really bad…We soon realized that what we had in mind when we start drawing can be very different from what we got after done drawing.

For instance, we might have this in mind at first:

The above image is the edge-extracted sketch from an illustration from Danbooru2019.

But here is what we got:

Cute but…far from anime right? So we looked up for available models to help us. But unfortunately, there is no single model that has all the functionalities we look for. There are two main problems: First, they’re not flexible enough. By saying not being flexible, I mean some of them only provide some sliders or buttons for users to choose, and no further engagement involved. Though we acknowledge our unprofessionalism in drawing, we still want to participate more in the process. Here is an example of such works:

Or, they may ask users to pick an already well-formed image as a base:

We found what we want is a sketch-to-image translation that can turns user’s sketch into some realistic anime characters. By doing this, we can engage more in the creation. But remember, the most important thing is: we’re bad at drawing! One should expect that we cannot provide good sketches. Therefore, our model should be able to take care of this and convert sketches to something more professional, while at the same time, keeping the original form as much as possible. Without this constraint, there would be no point of doing this. Another problem that came up in the middle is that majority sketch-to-image works (including all drawing domain) tends to overfit on user’s input, meaning the models expect the inputs to provide enough details. Inspired by this work, we proposed an solution to this. More details will be discusses below.

The second problem with existing models is users don’t have free control with colors used in generated images. Indeed, such as MakeGirlsMoe, users could choose colors, but only for limited areas — eyes and hairs. It’d be more interesting if users can have more control by providing some color hints while drawing sketches.

To fill in the gap between what we want and what we found, we have the following contributions in this work!

  • a flexible sketch-to-image translation network that allows for unprofessional inputs
  • automatic colorization using color hints provided by users
  • build an interactive panel for users and shows the anime character created

Now let’s move on to the details!

Data collection & preprocessing

We collected our data from Danbooru2019, which is a widely used dataset for anime illustrations related tasks. Specifically, we downloaded 30k face-cropped images (512x512). These images don’t come with corresponding sketches though. So we did edge extraction. We’ve tried different tools, including Photoshop (Photocopy filer), DexiNed, and our detial filter. On the top of resulted sketches, we did a sketch simplification that help make the lines clear and neat, using this work. Here are some examples:

The original image comes from Google search.

As one could see, Photoshop gives pretty neat and detailed edges, while DexiNed’s results have some noises here and there. In our first attempt, we used PS. However, we found that because the extracted sketches are too “perfect”, and when the model actually learns from these, the “overfitting on professionalism issue” we discussed above shows up. Therefore, we turned to design our own detail filter. As the name suggests, we aim to filter out some details in the original images. Specifically, we first use Gaussian blur, then Canny edge from OpenCV, and finally discard some independent lines that are shorter than certain pixels. This gives us relatively incomplete and simple sketches. We believe by doing this, our model can be forced to learn without seeing complete information, which is exactly the case when it sees a user’s input later when inferencing.

After data preprocessing, we have 30k pairs of illustrations and sketches.

Model

The most intuitive way for image translation tasks is to have a GAN that takes an input image and then generates an output image (we’ll refer this to Simple GAN later). This is definitely a valid approach, however, not really suitable for our case. To make such network perform well during testing, the test inputs should be expected to come from similar distribution as the training inputs. It should be clear enough by now that such expectation is what will be violated in the problem we’re trying to solve. Therefore, we propose an extra projection phase in-between, which is acheived by KNN + constrained least-squares minimization. We’ll touch on this in detail, but let’s take a look at our model:

During training, we have two stages. In stage 1, we pass all training sketches into an Autoencoder trying to reproduce the sketches. In stage 2, we freeze the encoder of Autoencoder, feed in the sketch, and use the latent vector of size 512 as one of the inputs of the generator. Another input of the generator is color hint, as one could see from the above diagram. It is obtained by using the PIL drawing library with a round brush, inspired by this work. The brush randomly picks colors and locations from the input image and draw strokes on an empty canvas. Doing this for multiple times, we will obtain a color hint. This is to simulate what users provide when inferencing, where we expect them to draw some colors on certain areas; for example, draw some red in eyes, or some yellow in hair. By randomly selecting, we aim to increase robustness. We’ll refer this as color hint simulation in the later part of this blog. After this, the color hints (512x512) are compressed into 128x128, combined with a binary mask (128x128), and then fed into an intermediate layer of the generator. Our reseaoning for this is the model might not need to process the color infromation from the very beginning. With the generated images from the generator, we’ll just input (image, color hint) pairs into discriminator. After all training is done, we’ll save the latent vectors for all training sketches for later use.

When it comes to inference time, things are little bit different. First, we only activate the encoder of Autoencoder. Second, using the latent vector and saved latent vector set of all training data, we conduct a KNN. The purpose of KNN here is to find all close neighbors of test sketch in the training set, which we consider to be the better-looking and more professional sketches. Then a projection is made, with the following formulation:

Here we’re trying to minimize the distance between the features of the test sketch and the weighted avergae of the features of its neighbors. By doing this, we could keep the original form as much as possible. Again, without this constraint, there is no point of doing this, as we could just generate some unrelated images without using users’ inputs at all. To summarize, the projection process here is to convert users’ unprofessional inputs into something more professional but maximally keeping the user’s intends at the same time. It’s true that we did some pre-simplification of our training sketches, but remember, the training set is just simplified and incomplete while the forms are still professional. The users’ inputs, on the other side, are at the same time pretty unprofessional.

After solve the least-squared minimization, we get a set of weights for all neighbors and finally a projected latent vector for test input sketch. Then following the same procedure, it will be fed into the generator with the color hints provided by users. At this time, the color hints are not simulated results but rather exactly come from the canvas where users draw on. With some experiments, we found a neighbor size of 10 can produce reasonable performance.

To better illustrate the effect of KNN + projection we discusses above, here are two examples:

user inputs (left); projected sketches (right)

On the left are two user inputs. After doing KNN + projection, we pass the projected latent vectors into the decoder of Autoencoder that we trained and get the resulted reproduced sketches. On can notice the model is able to turn incomplete sketches into something more detailed and anime-like. One disclaimer here is the use of decoder is only to demonstrate the effect of our proposed method. It was not designed to produce the most realistic images (since the training data for it are not complete and it only tries to replicate). The later processing in the generator provides more refinement where another source of information (color) is used together.

Experiment & Result

In this section we’ll compare results from Simple GAN and our proposed model. As discussed above, Simple GAN is the most intuitive way for image translation. However, in our problem, we don’t want the generator to directly learn from user inputs, as they’re not quite anime-like. Here are examples of using Simple GAN:

user input (left); generated image using Simple GAN (right)

It’s a reasonable reproduction, except the color (we’re talk about the color problem later). But it not what we want.

On the other hand, here are some results we got from our proposed network:

sketch (left); color hint (middle); generated images from proposed network (right)

Using KNN + projection, our model is able to map user’s sketch into the latent space and “modify” it into a more professional-look shape, while maintaining the original form. Of course, these examples were selected from our fairly good looking test results. There are many failure cases. Most of them were caused by unsuccessful color filling, which we found a little bit frustrating as it was one of our main goals in this projects.

Diagnose

As one can see from above results, even for these reasonably good ones, colors were not exactly filled according to instruction. For example, in the third row, a yellow hint on eyes from user was not reflected well on the output. Also, the background colors of all outputs, though not indicated by users, or, in other words, indicated as white, seem to be out of control. And we found that this color filling issue also seems to vary in different epochs. We think there could be three possible reasons.

Firstly, it could be the case that our network architecture is not able to digest color information well enough. Currently, the generator and discriminator receive color hints in intermediate layers. Specifically, in generator’s case, it was fed into the model after the first 3 Conv + Residual block, and then processed together with input vector during the last 2 Conv + Residual block. Due to time constraints, we were not able to test different variations of this. But we think different stages of introduction may have significant effects on learning efficiency.

Secondly, we may need to adjust our hint simulation process. As introduced above, color hints wereobtained by using the PIL drawing library with a round brush. The brush randomly picks colors and locations from the input image and draw strokes on an empty canvas. Here is an example:

It may be too sparse for the model to effectively learn. Since randomness exits during training, the issue could be more severe or less, but in general, our results indicate that we may need to modify this method. Since users tend to give color towards the center area of the canvas, we could make simulated colors denser. However, it would become a tradeoff with the flexibility we look for. Alternatively, we could increase the brush sizes, which would cover more pixels while maintaining the spread. We would like to try on this and see how it affects the results.

Finally, we may need to be more careful with our data. Though our data are all face-cropped anime images, the background are really noisy. Some of them have pretty colorful backgrounds, while some may have more than one faces in one image. The worst is that some of the images are comic strips. While our intended use of the model generally require more clean inputs, we acknowledge that the dataset might not be the best one to use, or we may need to do more careful preprocessing before training.

Next steps

In future work, we’d like to solve the color filling issue. Also, we’d like to work on another proposed version our model. To further increase flexibility for users, we proposed a facial component-wise network, where input sketches will be divided into 5 components are each of them will have a corresponding Autoencoder. The intuition of this is simple. By doing this users would have free control over different parts of the faces. For example, users could draw one eye open and the other close. However, This is harder to train and require more resources and time, and thus we decided to leave this out in this project. But we believe this would be an exciting improvement and fun to play with.

That’s all for this blog. No matter you are an anime lover like us or not, we hope you enjoy the work! We welcome all suggestions and comments.

One can find our codes here; we’ll soon release our panel once we finalize everything.

--

--