 # Activation Functions in Neural Networks: What You May Not Know | by Thu Dinh | Jul, 2022

## How to identify the best (and worst) activation functions Photo by JJ Ying on Unsplash

When I first started learning ML, I was told that activations are used to mimic the “neuron activations” in the brain, hence the name neural networks. It was not until much later that I learned about the finer intricacies of this building block.

In this article, I will explain two key concepts and the intuition behind Activation functions in Deep Neural Networks:

• Why we need them,
• Why we can’t just pick any non-linear function as activation.

All figures and results in this article were created by author: Equations are written using TeXstudio; models are created using Keras, and visualized with Netron; and graphs are plotted using matplotlib.

Terminology: An activation is a non-linear function. Simply put, it is any function not of the form `y = mx + b`.

In a Neural Network, these functions are usually put after the output of each Convolution or Fully-Connected (a.k.a. Dense) layer. Their main job is to “activate the neurons”, i.e. capture the non-linearity of the model.

But what is non-linearity, and why is it important?

Consider the following simple network with two Dense layers (if you have worked on the MNIST dataset, this may look familiar):

Let the input be `x`, the weight and bias of the first layer be `W_1, b_1`, and of the second layer be `W_2, b_2`. With activation function `σ_1, σ_2`, the output of the first layer is:

And the output of the whole model is:

But what if we did not use any activation function? Without `σ_1` and `σ_2`, the new output would be:

Notice that this equation could be simplified to:

which is equivalent to a shallow network with one Dense layer. Simply put, the second layer did not add any useful information at all. Our model is now equivalent to:

We can generalize this analysis to any arbitrary number of layers, and the result will still hold. (A similar result from Calculus: Composition of linear functions is still a linear function.)

So, for a Deep network to even make sense, we have to apply activations to each hidden output.

Otherwise, we will end up with a shallow network, and the learning capability will be severely limited.

Here is a quick experiment on MNIST to further illustrate the result:

For most practical systems/models, the activation is one of these three: ReLU, Sigmoid, or Tanh (hyperbolic tangent).

## ReLU

The simplest (and arguably best) activation. If an output of a hidden layer is negative, we simply set it to zero:

And the graph of ReLU is:

Pros:

Cons:

• Dead ReLU: If all outputs are negative, the gradient will be clipped to zero. This can be circumvented by better weight initialization.

## Sigmoid and Tanh

These functions have the form:

(Can you derive the derivatives of these functions? Don’t worry, modern ML frameworks like PyTorch and TensorFlow provide these activations for free, with back-propagation already built-in and optimized.)

And their graph is:

They behave very similarly, except we cap the output between [0,1] for Sigmoid, and between [-1,1] for Tanh.

Pros:

• Does not blow up activation.
• Sigmoid is good for capturing “probabilities”, as the output is capped between 0 and 1. (These probabilities do not sum up to 1, we need Softmax activation for that.)

Cons:

• Slower back-propagation computation.
• Slower convergence in many cases.
• Vanishing gradient: the graph is flat when input is far away from zero. This can be circumvented by using regularization.

When it comes to non-linearity, a few other functions come to mind: quadratic, square root, logarithm…

Why do we not use these functions in practice?

As a rule of thumb, a good activation:

• Should be defined for all real numbers,
• Is differentiable, and back-propagation can be implemented efficiently,
• Can be explained “heuristically”.

Some problems with the three functions above are:

• Quadratic: gives no meaningful signal. An output of -2 or 2 will give the same result.
• Square root: not defined for `x < 0`.
• Logarithm: Not defined for `x <= 0`, this function is also unbounded near zero.

As a practice, you can consider other non-linear functions you have encountered in Calculus/Linear Algebra. Think about why we would not use them. This is one good exercise to improve your intuition for Deep Learning.

For example: Can I extend the square root function to negative values (by drawing symmetrically) and use this as an activation function?