Activation Functions in Neural Networks: What You May Not Know | by Thu Dinh | Jul, 2022

How to identify the best (and worst) activation functions

Photo by JJ Ying on Unsplash

When I first started learning ML, I was told that activations are used to mimic the “neuron activations” in the brain, hence the name neural networks. It was not until much later that I learned about the finer intricacies of this building block.

In this article, I will explain two key concepts and the intuition behind Activation functions in Deep Neural Networks:

  • Why we need them,
  • Why we can’t just pick any non-linear function as activation.

All figures and results in this article were created by author: Equations are written using TeXstudio; models are created using Keras, and visualized with Netron; and graphs are plotted using matplotlib.

Some Activations supported by Keras
Some Activations supported by Keras

Terminology: An activation is a non-linear function. Simply put, it is any function not of the form y = mx + b.

In a Neural Network, these functions are usually put after the output of each Convolution or Fully-Connected (a.k.a. Dense) layer. Their main job is to “activate the neurons”, i.e. capture the non-linearity of the model.

But what is non-linearity, and why is it important?

Consider the following simple network with two Dense layers (if you have worked on the MNIST dataset, this may look familiar):

A simple two-layer network

Let the input be x, the weight and bias of the first layer be W_1, b_1, and of the second layer be W_2, b_2. With activation function σ_1, σ_2, the output of the first layer is:

And the output of the whole model is:

But what if we did not use any activation function? Without σ_1 and σ_2, the new output would be:

Notice that this equation could be simplified to:

which is equivalent to a shallow network with one Dense layer. Simply put, the second layer did not add any useful information at all. Our model is now equivalent to:

A shallow one-layer network

We can generalize this analysis to any arbitrary number of layers, and the result will still hold. (A similar result from Calculus: Composition of linear functions is still a linear function.)

So, for a Deep network to even make sense, we have to apply activations to each hidden output.

Otherwise, we will end up with a shallow network, and the learning capability will be severely limited.

Here is a quick experiment on MNIST to further illustrate the result:

Some quick results on MNIST

For most practical systems/models, the activation is one of these three: ReLU, Sigmoid, or Tanh (hyperbolic tangent).

ReLU

The simplest (and arguably best) activation. If an output of a hidden layer is negative, we simply set it to zero:

And the graph of ReLU is:

ReLU Activation function

Pros:

Cons:

  • Dead ReLU: If all outputs are negative, the gradient will be clipped to zero. This can be circumvented by better weight initialization.

Sigmoid and Tanh

These functions have the form:

(Can you derive the derivatives of these functions? Don’t worry, modern ML frameworks like PyTorch and TensorFlow provide these activations for free, with back-propagation already built-in and optimized.)

And their graph is:

Sigmoid and Tanh Activation function

They behave very similarly, except we cap the output between [0,1] for Sigmoid, and between [-1,1] for Tanh.

Pros:

  • Does not blow up activation.
  • Sigmoid is good for capturing “probabilities”, as the output is capped between 0 and 1. (These probabilities do not sum up to 1, we need Softmax activation for that.)

Cons:

  • Slower back-propagation computation.
  • Slower convergence in many cases.
  • Vanishing gradient: the graph is flat when input is far away from zero. This can be circumvented by using regularization.

When it comes to non-linearity, a few other functions come to mind: quadratic, square root, logarithm…

Why do we not use these functions in practice?

As a rule of thumb, a good activation:

  • Should be defined for all real numbers,
  • Is differentiable, and back-propagation can be implemented efficiently,
  • Can be explained “heuristically”.

Some problems with the three functions above are:

  • Quadratic: gives no meaningful signal. An output of -2 or 2 will give the same result.
  • Square root: not defined for x < 0.
  • Logarithm: Not defined for x <= 0, this function is also unbounded near zero.

As a practice, you can consider other non-linear functions you have encountered in Calculus/Linear Algebra. Think about why we would not use them. This is one good exercise to improve your intuition for Deep Learning.

For example: Can I extend the square root function to negative values (by drawing symmetrically) and use this as an activation function?

Extended square root function

(Hint: There is something very wrong with this activation function!)

Activation function is often an afterthought in building Deep Learning models. However, there is some subtlety in its mechanics that you should be aware of. Hopefully, this article gives you a better insight into the fundamental idea of activations, and why we would choose some functions over others.

Happy learning!

Leave a Reply

Your email address will not be published.