# Loss Functions and Their Use In Neural Networks | by Vishal Yathish | Aug, 2022

## Overview of loss functions and their implementations

Loss functions are one of the most important aspects of neural networks, as they (along with the optimization functions) are directly responsible for fitting the model to the given training data.

This article will dive into how loss functions are used in neural networks, different types of loss functions, writing custom loss functions in TensorFlow, and practical implementations of loss functions to process image and video training data — the primary data types used for computer vision, my topic of interest & focus.

First, a quick review of the fundamentals of neural networks and how they work.

**Neural networks** are a set of algorithms that are designed to recognize trends/relationships in a given set of training data. These algorithms are based on the way human neurons process information.

This equation represents how a neural network processes the input data at each layer and eventually produces a predicted output value.

To **train** — the process by which the model maps the relationship between the training data and the outputs — the neural network updates its hyperparameters, the weights, *wT*, and biases, *b, *to satisfy the equation above.

Each training input is loaded into the neural network in a process called **forward propagation**. Once the model has produced an output, this predicted output is compared against the given target output in a process called **backpropagation** — the hyperparameters of the model are then adjusted so that it now outputs a result closer to the target output.

This is where loss functions come in.

A **loss function** is a function that **compares** the target and predicted output values; measures how well the neural network models the training data. When training, we aim to minimize this loss between the predicted and target outputs.

The **hyperparameters** are adjusted to minimize the average loss — we find the weights, *wT*, and biases, *b*, that minimize the value of *J* (average loss).

We can think of this akin to residuals, in statistics, which measure the distance of the actual *y* values from the regression line (predicted values) — the goal being to minimize the net distance.

For this article, we will use Google’s **TensorFlow** library to implement different loss functions — easy to demonstrate how loss functions are used in models.

In TensorFlow, the loss function the neural network uses is specified as a parameter in model.compile() —the final method that trains the neural network.

`model.compile(loss='mse', optimizer='sgd')`

The loss function can be inputed either as a String — as shown above — or as a function object — either imported from TensorFlow or written as custom loss functions, as we will discuss later.

`from tensorflow.keras.losses import mean_squared_error`

model.compiile(loss=mean_squared_error, optimizer='sgd')

All loss functions in TensorFlow have a similar structure:

`def loss_function (y_true, y_pred): `

return losses

It must be formatted this way because the model.compile() method expects only two input parameters for the loss attribute.

In supervised learning, there are two main types of loss functions — these correlate to the 2 major types of neural networks: regression and classification loss functions

- Regression Loss Functions — used in regression neural networks; given an input value, the model predicts a corresponding output value (rather than pre-selected labels); Ex. Mean Squared Error, Mean Absolute Error
- Classification Loss Functions — used in classification neural networks; given an input, the neural network produces a vector of probabilities of the input belonging to various pre-set categories — can then select the category with the highest probability of belonging; Ex. Binary Cross-Entropy, Categorical Cross-Entropy

## Mean Squared Error (MSE)

One of the most popular loss functions, MSE finds the average of the squared differences between the target and the predicted outputs

This function has numerous properties that make it especially suited for calculating loss. The difference is squared, which means it does not matter whether the predicted value is above or below the target value; however, values with a large error are penalized. MSE is also a convex function (as shown in the diagram above) with a clearly defined global minimum — this allows us to more easily utilize **gradient descent optimization** to set the weight values.

Here is a standard implementation in TensorFlow — built into the TensorFlow library as well.

`def mse (y_true, y_pred): `

return tf.square (y_true - y_pred)

However, one disadvantage of this loss function is that it is very sensitive to outliers; if a predicted value is significantly greater than or less than its target value, this will significantly increase the loss.

## Mean Absolute Error (MAE)

MAE finds the average of the absolute differences between the target and the predicted outputs.

This loss function is used as an alternative to MSE in some cases. As mentioned previously, MSE is highly sensitive to outliers, which can dramatically affect the loss because the distance is squared. MAE is used in cases when the training data has a large number of outliers to mitigate this.

Here is a standard implementation in TensorFlow — built into the TensorFlow library as well.

`def mae (y_true, y_pred): `

return tf.abs(y_true - y_pred)

It also has some disadvantages; as the average distance approaches 0, gradient descent optimization will not work, as the function’s derivative at 0 is undefined (which will result in an error, as it is impossible to divide by 0).

Because of this, a loss function called a **Huber Loss** was developed, which has the advantages of both MSE and MAE.

If the absolute difference between the actual and predicted value is less than or equal to a threshold value, 𝛿, then MSE is applied. Otherwise — if the error is sufficiently large — MAE is applied.

This is the TensorFlow implementation —this involves using a wrapper function to utilize the threshold variable, which we will discuss in a little bit.

`def huber_loss_with_threshold (t = 𝛿): `

def huber_loss (y_true, y_pred):

error = y_true - y_pred

within_threshold = tf.abs(error) <= t

small_error = tf.square(error)

large_error = t * (tf.abs(error) - (0.5*t))

if within_threshold:

return small_error

else:

return large_error

return huber_loss

## Binary Cross-Entropy/Log Loss

This is the loss function used in binary classification models — where the model takes in an input and has to classify it into one of two pre-set categories.

Classification neural networks work by outputting a vector of probabilities — the probability that the given input fits into each of the pre-set categories; then selecting the category with the highest probability as the final output.

In binary classification, there are only two possible actual values of y — 0 or 1. Thus, to accurately determine loss between the actual and predicted values, it needs to compare the actual value (0 or 1) with the probability that the input aligns with that category (*p(i)* = probability that the category is 1; 1 — *p(i)* = probability that the category is 0)

This is the TensorFlow implementation.

`def `**log_loss** (y_true, y_pred):

y_pred = tf.clip_by_value(y_pred, le-7, 1 - le-7)

error = y_true * tf.log(y_pred + 1e-7) (1-y_true) * tf.log(1-y_pred + 1e-7)

return -error

## Categorical Cross-Entropy Loss

In cases where the number of classes is greater than two, we utilize categorical cross-entropy — this follows a very similar process to binary cross-entropy.

Binary cross-entropy is a special case of categorical cross-entropy, where *M* = 2 — the number of categories is 2.

As seen earlier, when writing neural networks, you can import loss functions as function objects from the tf.keras.losses module. This module contains the following built-in loss functions:

However, there may be cases where these traditional/main loss functions may not be sufficient. Some examples would be if there is too much noise in your training data (outliers, erroneous attribute values, etc.) — which cannot be compensated for with data preprocessing — or use in unsupervised learning (as we will discuss later). In these instances, you can write custom loss functions to suit your specific conditions.

`def `**custom_loss_function** (y_true, y_pred):

return losses

Writing custom loss functions is very straightforward; the only requirements are that the loss function must take in only two parameters: y_pred (predicted output) and y_true (actual output).

Some examples of these are 3 custom loss functions, in the case of a variational auto-encoder (VAE) model, from *Hands-On Image Generation with TensorFlow* by Soon Yau Cheong.

def vae_kl_loss(y_true, y_pred):

kl_loss = - 0.5 * tf.reduce_mean(1 + vae.logvar - tf.square(vae.mean) - tf.exp(vae.logvar))

return kl_lossdef vae_rc_loss(y_true, y_pred):

#rc_loss = tf.keras.losses.binary_crossentropy(y_true, y_pred)

rc_loss = tf.keras.losses.MSE(y_true, y_pred)

return rc_lossdef vae_loss(y_true, y_pred):

kl_loss = vae_kl_loss(y_true, y_pred)

rc_loss = vae_rc_loss(y_true, y_pred)

kl_weight_const = 0.01

return kl_weight_const*kl_loss + rc_loss

Depending on the math of your loss function, you may need to add additional parameters — such as the threshold value 𝛿 in the Huber Loss (above); to do this, you must include a wrapper function, as TF will not allow you to have more than 2 parameters in your loss function.

`def custom_loss_with_threshold (threshold = 1): `

def custom_loss (y_true, y_pred):

pass #Implement loss function - can call the threshold variable

return custom_loss

Let’s look at some practical implementations of loss functions. Specifically, we will look at how loss functions are used to process image data in various use cases.

## Image Classification

One of the most fundamental aspects of computer vision is image classification — being able to assign an image to one of two or more pre-selected labels; this allows users to recognize objects, writing, people, etc. within the image (in image classification, the image usually has only one subject).

The most commonly used loss function in image classification is cross-entropy loss/log loss (binary for classification between 2 classes and sparse categorical for 3 or more), where the model outputs a vector of probabilities that the input image belongs to each of the pre-set categories. This output is then compared to the actual output, represented by a vector of equal size, where the correct category has a probability of 1 and all others have a probability of 0.

A rudimentary implementation of this can be imported directly from the TensorFlow library and does not require any further customization or modification. Below is an excerpt of an open-source Deep Convolutional Neural Network (CNN) by IBM which classifies document images (id cards, application forms, etc.).

model.add(Dense(5, activation='sigmoid'))

model.summary()model.compile(optimizer='adam', loss='categorical_crossentropy',

metrics=['accuracy'])

Research is currently being done to develop new (custom) loss functions to optimize multi-class classification. Below is an excerpt of a proposed loss function developed by researchers at Duke University, which extends categorical cross-entropy loss by looking at patterns in incorrect results as well, to speed up the learning process.

`def matrix_based_crossentropy (output, target, matrixA, from_logits = False):`

Loss = 0

ColumnVector = np.matul(matrixA, target)

for i, y in enumerate (output):

Loss -= (target[i]*math.log(output[i],2))

Loss += ColumnVector[i]*exponential(output[i])

Loss -= (target[i]*exponential(output[i]))

newMatrix = updateMatrix(matrixA, target, output, 4)

return [Loss, newMatrix]

## Image Generation

Image generation is a process by which neural networks create images (from an existing library) per the user’s specifications.

Throughout this article, we have dealt primarily with the use of loss functions in supervised learning — where we have had clearly labeled inputs, *x*, and outputs, *y*, and the model was supposed to determine the relationship between these two variables.

Image generation is an application of unsupervised learning — where the model is required to analyze and find patterns in unlabelled input datasets. The basic principle of loss functions still holds; the goal of a loss function in unsupervised learning is to determine the difference between the input example and the hypothesis — the model’s approximation of the input example itself.

For example, this equation models how MSE would be implemented for unsupervised learning, where *h(x)* is the hypothesis function.

Below is an excerpt of a Contrastive Language-Image Pretraining (CLIP) diffusion model — which generates art (images) through text descriptions — along with a few image samples.

if args.init_weight:

result.append(F.mse_loss(z, z_orig) * args.init_weight / 2)lossAll,img = ascend_txt()if i % args.display_freq == 0:

checkin(i, lossAll)loss = sum(lossAll)

loss.backward()

Another example of the use of loss functions in image generation was shown above in our **Custom Loss Functions** section, in the case of a variational auto-encoder (VAE) model.

In this article, we covered 1) how loss functions work, 2) how they are employed within neural networks, 3) different types of loss functions to suit specific neural networks, 4) 2 specific loss functions and their uses cases, 5) writing custom loss functions, and 6) practical implementations of loss functions for image processing.