Jarque-Bera Test with Python. Having your data follow a normal… | by John DeJesus | Aug, 2022

Having your data follow a normal distribution is great since it would allow you to take advantage of all the benefits a normal distribution provides. But how do we know our data is following a normal distribution? That is where the Jarque-Bera Test comes in. In this article, we will learn about the Jarque-Bera Test, how we can apply it to our data in Python with Scipy, some quick practice, and create the test statistic equation from scratch.

Photo by Julia Koblitz on Unsplash

The Jarque-Bera Test is a test to determine if a set of data values follows the normal distribution based on the data’s skewness and kurtosis. The test statistic equation incorporating skewness and kurtosis is:

Jarque Bera Test Statistic Equation. Made by Author on https://latex.codecogs.com/eqneditor/editor.php

Where n = the number of values for the data. S is the sample skewness (how much the data leans away from the mean) as defined below:

Sample Skewness for Jarque Bera Test Statistic Equation. Made by Author on https://latex.codecogs.com/eqneditor/editor.php

K is the sample kurtosis (how thick the tails of the distribution are) as defined below:

Sample Kurtosis for Jarque Bera Test Statistic Equation. Made by Author on https://latex.codecogs.com/eqneditor/editor.php

The test statistic result will always be greater than or equal to zero since:

  1. The sample skewness in the test statistic equation is always squared, meaning S² is always positive or zero.
  2. The sample kurtosis is always positive or zero since the numerator is raised to the 4th power and the denominator is squared.
  3. The difference between the sample kurtosis and 3 is squared, meaning this term of the test statistic equation is always positive or zero.
  4. The sum of two terms ≥ 0 will also be greater than or equal to zero.

We know if our data follows a normal distribution if the test statistic is close to zero and the p-value is larger than our standard 0.05. The p-value relates to a null hypothesis that the data is following a normal distribution. If the test statistic is large and the p-value is less than 0.05, the data does not follow a normal distribution.

Now that we have the test statistic established, you can pull a data set with numeric features and apply the Jarque-Bera Test to them. For this article, I will use the Popular Baby Names data set from OpenData NYC.

Data Set Up and Exploration

First, we will load our import our necessary packages and import our data. Notice we are importing the jarque_bera function from Scipy. We then inspect the data with pd.info().

Results of pandas.info() on Popular Baby Names Data

Looking at the info method results, we see a Count column we can apply the Jacque-Bera Test to. This column has the total number of babies with a given name. We also see we have no missing values in that column, so we don’t need to apply methods such as dropna() to it to remove null values. Considering there is a Year of Birth column, let’s check to see how many years we have available.

Result of calling value_counts method on baby_names[“Year of Birth”].

So we have 8 years’ worth of baby names. It is interesting to see the earlier years had more total names than in the latter half of the years. Also noticed 2018 is skipped and there is nothing on the data’s webpage about that. For the purposes of this article though that will not hurt anything.

The quick way to apply our normal distribution test is to apply the scipy.stats function to our column of data as shown below.

This will output a tuple with the test statistic on the left and a p-value.

(533643.0597435309, 0.0)

The value on the left is the test statistic from our JB equation at the beginning of the article. The value on the right is the p-value from our null hypothesis that the data is following a normal distribution. It is calculated from the difference of 1 and the CDF of the chi-squared distribution with two degrees of freedom applied to the test statistic. CDF again stands for cumulative distribution function, the one used to determine the probability of a range of possibilities in a distribution.

Since the test statistic is incredibly large and the p-value is less than 0.05, we reject the null hypothesis that the Counts data comes from a normal distribution.

If you are using this data set and want further practice, I would suggest you apply the Jarque-Bera test to the Count data for all 8 years. You can do this individually or create a for-loop iterating through each year. See below when you are ready to inspect the answers.

Answers

Jarque-Bera Count Results for each year

Similar to the results for the Count column on all years, the test statistics is very high while the p-values are less than 0.05. Therefore, the null hypothesis assuming the data from each year is from a normal distribution is rejected.

Photo by Wei-Cheng Wu on Unsplash

Here we will attempt to create the Jarque-Bera Test Statistic Equation from scratch. This will be a good coding exercise for you since the formula is not too complicated in terms of mathematical notation. For simplicity let’s focus on using a Pandas data frame column as the input. After you make an attempt you can check out my version below.

Note mine is probably not optimal since I am outlining more steps. For the cleanest version of Jarque-Bera being coded, check out the Scipy source code on the Jarque-Bera Test documentation. I would advise you to check out the source code to also see how the p-value for this test is calculated using the chi-squared CDF on the test statistic. As of the writing of this article, chi2 is imported from scipy.stats.

Photo by Randy Tarampi on Unsplash

Given there are other methods to test if data is coming from a normal distribution, we need to see when it is best to use the Jarque-Bera Test. According to these sources, it is recommended the total data values of a sample are large. At least greater than 2000. This is so the data can be compared to the chi-squared distribution with 2 degrees of freedom. Anything smaller could lead to a misleading test statistic and p-value.

Thanks for reading! Again, having your data follow a normal distribution is great! It allows you to take advantage of a symmetrical distribution where the mean, median, and mode are all equal. You also get to rely on the Central Limit Theorem as you work with more samples in your data. You just need a test like the Jarque-Bera Test to help you determine if you are working with a normal distribution.

If you enjoy reading on Medium and would like to support me further, you can use my referral link to sign up for a Medium membership. Doing so would support me financially with a portion of your membership fee which would be greatly appreciated.

If you have questions feel free to leave a comment or reach out on Linkedin and Twitter. Open to DM’s on Twitter. Also if you enjoyed this post, feel free to check out some of my other related posts below:

Until next time,

John DeJesus

Leave a Reply

Your email address will not be published.