Article 1: Understanding the Convolution function and CNN.

This article is a part of series “Convolutional Neural Networks A-Z”. I aim to expand this series to full knowledge of computer vision.

13 min readFeb 9, 2019

The step-by-step guide includes:

Why convolution?
What is convolution?
How convolution works?
The simplest CNN to understand various CNN layers.
Applications of convolution.

Why Convolution?

You all might be thinking right now if I am making a mistake writing “why” before “what”. Yes.. we always learn what the thing is before we come to know about why. But believe me, learning this way is always fun. First, I understand the task at hand and then find a solution to it. And while finding a solution, we learn new things. This is a better approach to learning as when understanding the problem, our mind gets curious of finding a way to solve it. So, let’s understand the reason why we need convolution in computer vision.

As you don’t know what convolution is, let’s put that word aside. Simply try to understand the problem. The understanding starts with “how humans visualize things?”. At this very moment, you are reading this article, right! And to be precise, you are reading this paragraph and being more precise, you are reading this lengthy sentence thinking that now I am going to say something like “to be more precise you are reading this particular word and then character…”

No.. I won’t say that. (Actually, I did)

What I am trying to tell you is, when you are reading a particular sentence, though your eyes are capturing other lines in this article, your brain is only focusing on this line in order to interpret what it actually means. What about the surrounding sentences— aren’t we actually seeing those? Yes.. we are, but it’s the focus of our brain which blurs out those useless stuff.

**Meanwhile, our brain to other part of this article**

This is how human brain works. There are very few people in this world whose brain can focus on more than one things at a time. Let, me give you another example: you are in a movie theater watching your favourite movie. Now, while watching the movie, if I ask you “what a person sitting in front and a slight left of you doing?”. Can you answer this? Yes.. but you need to transfer your focus from ‘the movie’ to ‘that person’. That’s exactly what I am talking about.

Computer Vision is a huge problem at hand (that researchers thought to have a separate field for CV tasks) where we train the models on multimedia inputs. Consider an even more simpler task of object detection where the model is only limited to identifying a person in the image. Now, to find the location of a person in image, the machine can focus on the whole image at once. But, if we are looking for an object like ‘a mouse’ in the image, the machine cannot find it out by looking at the whole image. The image is large but the mouse is small; it would be possible for the machine to locate mouse if it focused only on the part of image where the mouse actually was. And hence, we need the machine to focus on the whole image part-by-part in order to find that small object. This is the same procedure a human follows when finding objects.. Sometimes we don’t know about the objects around us until we actually get a focus on it.

Also, consider the case of face detection and recognition. Now, the computer needs to focus on face rather than the person. So, the computer needs to look at only a part of picture where the face is located. After focusing on the face, the computer needs to process it by using some past knowledge that it learned while training (if it has seen that face previously or not).

From the above reading, we can come to conclusion that: To solve any computer vision related problem at hand, we need to perform atleast two tasks:

Focus on different parts of input — the size of parts may vary at each phase like focusing of the person in phase 1 and focusing only on his/her face in phase 2.
Process the part on which you are focusing right now.

What is Convolution?

Now that you know the way to solve any computer vision problem, you can easily understand what convolution is. Convolution is a function which preforms the above listed tasks. Hence, a convolution function focuses on an image in parts, processes each part and provides the output.

Before going further, you should know about how digital images are represented in form of matrices. It’s a total different topic so I suggest you to go and learn it yourself. The most basic thing that you need to learn is image and video representation in form of arrays. You need to learn how each pixel is represented in an image/video frame. If you face problems learning it yourself, comment here and I will provide solutions.

Now, assuming you know how each pixel is represented, let’s continue understanding “what is convolution”. Convolution is a mathematical operation that takes two inputs such as image matrix and a filter or kernel. The image matrix is nothing but digital representation of image pixels and the filter/kernel is another matrix which is used to process the image matrix. The size of kernel is much smaller than the size of image which helps us to process small parts of the image. Note that the size of kernel is the same as the size of the part of image which needs to be focused on. It will be easy to understand this with an example.

Consider a 5 x 5 image whose pixel values are either 0 or 1 and filter matrix of size 3 x 3 as shown in below

Then the convolution of 5 x 5 image matrix multiplies the 3 x 3 parts of image with 3 x 3 filter matrix and provides the output which are called “Feature Maps” as shown below

**See how this processes the image by looking at parts of it**

In the above picture, the yellow parts denotes the focused image, the green part remains unfocused and the pink part represents the output which is obtained by simply multiplying individual pixels of focused image with individual values of the kernel and then adding the return values. Note that the values of the kernel matrix are those red numbers in small written in the bottom-right corner of this image.

How convolution works?

As you can see, each item in the feature matrix corresponds to a section of the image. The “yellow window” that moves over the image is called a kernel. Kernels are typically square and 3x3 is a fairly common kernel size for small-ish images. However, you can increase the size of kernel if you need to focus on larger parts of the image instead the small ones. The distance by which the window moves each time is called the stride. Here, the window moves by 1 position at a time (i.e. it shifts from left to right by 1 pixel at a time), hence stride=1. Additionally note that, the size of output image has reduced. The output image obtained by filtering the input image is smaller. To obtain the output image as the same size of input, the input image is padded with zeros around the perimeter before performing convolutions, which dampens the value of the convolutions around the edges of the image.

The size of output image depends on the values of size of image and kernel; and the values of stride and padding. This can be shown as below:

**Here, H = Height of input image, W = Width of input image, K = size of kernel, P = number of layers with 0 padding and S = stride length**

The goal of a convolution function is filtering. As we move over an image we check for patterns in that section of the image. You may be wondering how does it perform any kind of filtering by simply multiplying a kernel with some image, right? But experiments have proved that this really works. In fact, there are some predefined kernels which when applied to an image, outputs a filtered image. Have a look at it:

**Source:** **https://en.wikipedia.org/wiki/Kernel_(image_processing)**

Exercise: Write a code to perform convolution on an image which gives a sharpened image as output using the above shown kernel. I have wrote one and observed the changes. The thing is, it really works!

Convolutional Neural Networks (CNNs)

Convolutional Neural Network is a type of neural network which applies convolution function to the input image many times and that too with different size of filter in a step-by-step manner. In short, it takes input image, applies convolution to it, takes the output and reapplies convolution on that output and so on.. The convolution functions applied at each stage differs by the size of filter. There is no predefined architecture of CNN. A neural network whose base depends on convolution operation can be called a CNN.

In a CNN, the kernel matrix is actually a weight W. When training an image, these weights change, and so when it is time to evaluate an image, these weights return high values if it thinks it is seeing a pattern it has seen before. In short, the kernel matrix is not predefined but it is learnt as the network trains. The combinations of high weights from various filters let the network predict the content of an image. This is why in CNN architecture diagrams, the convolution step is represented by a box, not by a rectangle/square; the third dimension represents the filters.

If you still don’t understand filters, let’s take an example of a RGB image. This image can be thought of as an image with 3 filters: ‘R’, ‘G’ and ‘B’. The value of a particular pixel of ‘R’ filter tells the value of amount of red color in that pixel. This means, more the amount of red in a pixel, higher the value of ‘R’ filter for that pixel. This means, the ‘R’ filter identifies a pattern for “reddish-ness” in an image. Below is a simple CNN model:

Note that the convolutional layers ‘conv1’ and ‘conv2’ contains a 3-D cube represented in sliced form. Here, each slice denotes one filter/kernel. Note that the size of all the filters in one convolutional layer is same. Each filter learn a way to filtering the image. In short, a single convolutional layer can have more than one filters and hence, learn more than one ways of filtering the input image. Also, different convolutional layers can have different number of filters. It is a “common convention” that the number of filters increase as we go deeper into the neural network.

Now, we also see another type of layer other than convolutional layer in the above network — those with names ‘pool{x}’. Those are known as pooling layer. All of the CNNs will always include Pooling layers. So, let’s understand how this helps to achieve the task.

Pooling Layers: Pooling works very much like convolution, where we take a kernel and move the kernel over the image, the only difference is the function that is applied to the kernel and the image window isn’t linear.

Max pooling and Average pooling are the most common pooling functions. Max pooling takes the largest value from the window of the image currently covered by the kernel, while average pooling takes the average of all values in the window. While in convolution, for a given window, we multiplied pixels of image with values of kernel and then added them together to output a single value; in max/average pooling, we do the same except that, instead of adding the values after multiplying, we take max/average of values to output a single value.

Pooling looks the same like convolution but there’s a difference — there is no filter/kernel which is applied to the image. It just takes maximum or average from the red window. This doesn’t match with what I said above, right? So think of this in a different manner — pooling layers have all its kernel values 1. The reason why pooling layers don’t have kernels is, they aren’t used for filtering the image or detecting patterns in it. Their use is somewhat different. Let’s understand this..

So, how does the pooling layer help? The pooling layer actually acts as a zoom out layer. The output of the convolution layer i.e. the filtered image of smaller size it fed input to pooling layer. Now, the pooling layer zooms out of the image. You would just ask — why? So here’s the answer..

Suppose, you have a CNN with 5 convolution layers, all with same kernel size. You might think that keeping the same size kernels at all the layers will only allow you to find objects in that image which fit in this kernel/window size. For example, keeping a kernel size of 3 x 3, all the layers would be able to find patterns for small objects such as a mouse but won’t be able to find large objects such as a person. But pooling layers are a great help here.

When this output of pooling layer is given to next convolutional layer, even though this convolutional layer keeps the same size of filter as the previous one, new features are learned. This is because, the input to second convolution was a zoomed out image. Hence, even if we keep the same kernel size for all the convolutional layers, the area of focus of these layers will be different. If the first convolution will focus just on some text and then second one will also focus on the objects around the person. This is all because of the pooling layer.

Let me give you one more example for pooling layer. Suppose you are walking down a street and you see a banner attached high on a pole. To read the text this banner contains, you will focus exactly on the text. Suppose you are only shown the text and not the surroundings (the banner), you will never know that this text is a part of some banner. To detect the banner, you will zoom out your focus from the text to that banner. Now, you can tell that there’s a banner. Suppose that you never focused on the text at first i.e. you directly focused on the banner, then you won’t be able to tell what is actually written on that banner. So, this is the thing — you need to focus on the text as well as on the banner but you can’t do this at the same time, right! If you focus on the banner, you loose your focus on the text. And if you focus on the text, you don’t even know there’s a banner which surrounds it.

Now, there are two ways you can do this. First, focus on the banner and then zoom in to focus on the text; Second, focus on the text and then zoom out to get focused on banner. In real life, you can go for any approach. But for computers, the second (zoom out) approach is best suited. The only reason is, as you move ahead in the CNN, you modify the input to produce the output (you don’t save a copy of input at every step for backup) and as we already discussed, if you focus on the banner, the focus on text is lost. For computers, there is no way to go back and get the real input, or even if there’s a way, it time consuming.

Finally, for the pooling part, you now know:
1. Why pooling is important? — To zoom out
2. Why zooming out is important? — To have a look/focus on the surroundings/larger part of image.

I bluntly said here that pooling zooms out but how did I reach this conclusion? That’s something I will cover in the next article of this series. We are going to look at all these things with practical examples and code.

Applications of convolution

Convolution operation is used widely in all the CNNs. These include solving tasks like:

Image processing
Image classification
Video filtering
Object detection
Object recognition
Object Verification
Face detection
Face recognition
Image editing
Video editing
Image generation
Video generation and much more..

But let me be specific to some of the use cases that will motivate you more for learning various types of neural network architectures. You can make a neural network which can make any person smile by using “style transferring” technique, swap faces of two people, and even use CNN in tasks of NLP which is a field about teaching computers human language. That means, not only in field of computer vision, but also in the fields of Natural Language Processing are CNNs used.

Here, I gave complete introduction to CNNs. You may still be confused of how the CNNs really help! I said above that various experiments proved that convolution works but instead of accepting it, be curious of how!! Wait for my next articles where I will answer this question by actually showing you how. You will see what CNNs learn and how the input image is processed, which parts of image is it focusing on, which features are being extracted from images and much more.. I will be showing all these with example code (in keras).

Hope this article helps you.. it is my first article on medium. Comment any queries, suggestions and even my mistakes to help me improve. Thank you..