# Activation Functions in Neural Networks: What You May Not Know | by Thu Dinh | Jul, 2022

## How to identify the best (and worst) activation functions

When I first started learning ML, I was told that activations are used to mimic the “neuron activations” in the brain, hence the name neural networks. It was not until much later that I learned about the finer intricacies of this building block.

In this article, I will explain two key concepts and the intuition behind Activation functions in Deep Neural Networks:

- Why we need them,
- Why we can’t just pick any non-linear function as activation.

All figures and results in this article were created by author: Equations are written using TeXstudio; models are created using Keras, and visualized with Netron; and graphs are plotted using matplotlib.

*Terminology*: An activation is a non-linear function. Simply put, it is any function not of the form `y = mx + b`

.

In a Neural Network, these functions are usually put after the output of each Convolution or Fully-Connected (a.k.a. Dense) layer. Their main job is to “activate the neurons”, i.e. **capture the non-linearity** of the model.

But what is non-linearity, and why is it important?

Consider the following simple network with two Dense layers (if you have worked on the MNIST dataset, this may look familiar):

Let the input be `x`

, the weight and bias of the first layer be `W_1, b_1`

, and of the second layer be `W_2, b_2`

. With activation function `σ_1, σ_2`

, the output of the first layer is:

And the output of the whole model is:

But what if we did not use any activation function? Without `σ_1`

and `σ_2`

, the new output would be:

Notice that this equation could be simplified to:

which is equivalent to a shallow network with one Dense layer. Simply put, the second layer did not add any useful information at all. Our model is now equivalent to:

We can generalize this analysis to any arbitrary number of layers, and the result will still hold. (A similar result from Calculus: Composition of linear functions is still a linear function.)

So, for a Deep network to even make sense, we have to apply activations to each hidden output.

Otherwise, we will end up with a shallow network, and the learning capability will be severely limited.

Here is a quick experiment on MNIST to further illustrate the result:

For most practical systems/models, the activation is one of these three: ReLU, Sigmoid, or Tanh (hyperbolic tangent).

## ReLU

The simplest (and arguably best) activation. If an output of a hidden layer is negative, we simply set it to zero:

And the graph of ReLU is:

**Pros:**

**Cons:**

- Dead ReLU: If all outputs are negative, the gradient will be clipped to zero. This can be circumvented by better weight initialization.

## Sigmoid and Tanh

These functions have the form:

(Can you derive the derivatives of these functions? Don’t worry, modern ML frameworks like PyTorch and TensorFlow provide these activations for free, with back-propagation already built-in and optimized.)

And their graph is:

They behave very similarly, except we cap the output between [0,1] for Sigmoid, and between [-1,1] for Tanh.

**Pros:**

- Does not blow up activation.
- Sigmoid is good for capturing “probabilities”, as the output is capped between 0 and 1. (These probabilities do not sum up to 1, we need Softmax activation for that.)

**Cons:**

- Slower back-propagation computation.
- Slower convergence in many cases.
- Vanishing gradient: the graph is flat when input is far away from zero. This can be circumvented by using regularization.

When it comes to non-linearity, a few other functions come to mind: quadratic, square root, logarithm…

Why do we not use these functions in practice?

As a rule of thumb, a good activation:

- Should be defined for all real numbers,
- Is differentiable, and back-propagation can be implemented efficiently,
- Can be explained “heuristically”.

Some problems with the three functions above are:

- Quadratic: gives no meaningful signal. An output of -2 or 2 will give the same result.
- Square root: not defined for
`x < 0`

. - Logarithm: Not defined for
`x <= 0`

, this function is also unbounded near zero.

As a practice, you can consider other non-linear functions you have encountered in Calculus/Linear Algebra. Think about why we would not use them. This is one good exercise to improve your intuition for Deep Learning.

For example: Can I extend the square root function to negative values (by drawing symmetrically) and use this as an activation function?

*(Hint: There is something very wrong with this activation function!)*

Activation function is often an afterthought in building Deep Learning models. However, there is some subtlety in its mechanics that you should be aware of. Hopefully, this article gives you a better insight into the fundamental idea of activations, and why we would choose some functions over others.

Happy learning!