How to identify the best (and worst) activation functions
When I first started learning ML, I was told that activations are used to mimic the “neuron activations” in the brain, hence the name neural networks. It was not until much later that I learned about the finer intricacies of this building block.
In this article, I will explain two key concepts and the intuition behind Activation functions in Deep Neural Networks:
- Why we need them,
- Why we can’t just pick any non-linear function as activation.
All figures and results in this article were created by author: Equations are written using TeXstudio; models are created using Keras, and visualized with Netron; and graphs are plotted using matplotlib.
Terminology: An activation is a non-linear function. Simply put, it is any function not of the form
y = mx + b.
In a Neural Network, these functions are usually put after the output of each Convolution or Fully-Connected (a.k.a. Dense) layer. Their main job is to “activate the neurons”, i.e. capture the non-linearity of the model.
But what is non-linearity, and why is it important?
Consider the following simple network with two Dense layers (if you have worked on the MNIST dataset, this may look familiar):
Let the input be
x, the weight and bias of the first layer be
W_1, b_1, and of the second layer be
W_2, b_2. With activation function
σ_1, σ_2, the output of the first layer is:
And the output of the whole model is:
But what if we did not use any activation function? Without
σ_2, the new output would be:
Notice that this equation could be simplified to:
which is equivalent to a shallow network with one Dense layer. Simply put, the second layer did not add any useful information at all. Our model is now equivalent to:
We can generalize this analysis to any arbitrary number of layers, and the result will still hold. (A similar result from Calculus: Composition of linear functions is still a linear function.)
So, for a Deep network to even make sense, we have to apply activations to each hidden output.
Otherwise, we will end up with a shallow network, and the learning capability will be severely limited.
Here is a quick experiment on MNIST to further illustrate the result:
For most practical systems/models, the activation is one of these three: ReLU, Sigmoid, or Tanh (hyperbolic tangent).
The simplest (and arguably best) activation. If an output of a hidden layer is negative, we simply set it to zero:
And the graph of ReLU is:
- Dead ReLU: If all outputs are negative, the gradient will be clipped to zero. This can be circumvented by better weight initialization.
Sigmoid and Tanh
These functions have the form:
(Can you derive the derivatives of these functions? Don’t worry, modern ML frameworks like PyTorch and TensorFlow provide these activations for free, with back-propagation already built-in and optimized.)
And their graph is:
They behave very similarly, except we cap the output between [0,1] for Sigmoid, and between [-1,1] for Tanh.
- Does not blow up activation.
- Sigmoid is good for capturing “probabilities”, as the output is capped between 0 and 1. (These probabilities do not sum up to 1, we need Softmax activation for that.)
- Slower back-propagation computation.
- Slower convergence in many cases.
- Vanishing gradient: the graph is flat when input is far away from zero. This can be circumvented by using regularization.
When it comes to non-linearity, a few other functions come to mind: quadratic, square root, logarithm…
Why do we not use these functions in practice?
As a rule of thumb, a good activation:
- Should be defined for all real numbers,
- Is differentiable, and back-propagation can be implemented efficiently,
- Can be explained “heuristically”.
Some problems with the three functions above are:
- Quadratic: gives no meaningful signal. An output of -2 or 2 will give the same result.
- Square root: not defined for
x < 0.
- Logarithm: Not defined for
x <= 0, this function is also unbounded near zero.
As a practice, you can consider other non-linear functions you have encountered in Calculus/Linear Algebra. Think about why we would not use them. This is one good exercise to improve your intuition for Deep Learning.
For example: Can I extend the square root function to negative values (by drawing symmetrically) and use this as an activation function?
(Hint: There is something very wrong with this activation function!)
Activation function is often an afterthought in building Deep Learning models. However, there is some subtlety in its mechanics that you should be aware of. Hopefully, this article gives you a better insight into the fundamental idea of activations, and why we would choose some functions over others.