1 Introduction
In my previous “post” the question came up of how to check its data on normal distribution. There are several possibilities for this.
2 Loading the libraries
import pandas as pd
import numpy as np
import pylab
import scipy.stats as stats
import matplotlib.pyplot as plt
#For Chapter 4.1
from scipy.stats import shapiro
#For Chapter 4.2
from scipy.stats import normaltest
3 Visual Normality Checks
np.random.seed(1)
df = pd.DataFrame({
'Col_1': np.random.normal(0, 2, 30000),
'Col_2': np.random.normal(5, 3, 30000),
'Col_3': np.random.normal(-5, 5, 30000)
})
df.head()
3.1 Quantile-Quantile Plot
A popular plot for checking the distribution of a data sample is the quantile-quantile plot, Q-Q plot, or QQ plot for short.A perfect match for the distribution will be shown by a line of dots on a 45-degree angle from the bottom left of the plot to the top right. Often a line is drawn on the plot to help make this expectation clear. Deviations by the dots from the line shows a deviation from the expected distribution.
stats.probplot(df['Col_1'], dist="norm", plot=pylab)
pylab.show()
3.2 Histogram Plot
A simple and commonly used plot to quickly check the distribution of a sample of data is the histogram.
bins = np.linspace(-20, 20, 100)
plt.hist(df['Col_1'], bins, alpha=0.5, label='Col_1')
plt.hist(df['Col_2'], bins, alpha=0.5, label='Col_2')
plt.hist(df['Col_3'], bins, alpha=0.5, label='Col_3')
plt.legend(loc='upper right')
plt.show()
4 Statistical Normality Tests
A normal distribution can also be examined with statistical tests. Pyhton’s SciPy library contains two of the best known methods.
In the SciPy implementation of these tests, you can interpret the p value as follows.
- p <= alpha: reject H0, not normal
- p > alpha: fail to reject H0, normal
4.1 Shapiro-Wilk Test
The Shapiro-Wilk test evaluates a data sample and quantifies how likely it is that the data was drawn from a Gaussian distribution.
shapiro(df['Col_1'])
stat, p = shapiro(df['Col_1'])
print('Statistics=%.3f, p=%.3f' % (stat, p))
alpha = 0.05
if p > alpha:
print('Sample looks Gaussian (fail to reject H0)')
else:
print('Sample does not look Gaussian (reject H0)')
4.2 D’Agostino’s K² Test
The D’Agostino’s K2 test calculates summary statistics from the data, namely kurtosis and skewness, to determine if the data distribution departs from the normal distribution,
normaltest(df['Col_1'])
stat, p = normaltest(df['Col_1'])
print('Statistics=%.3f, p=%.3f' % (stat, p))
alpha = 0.05
if p > alpha:
print('Sample looks Gaussian (fail to reject H0)')
else:
print('Sample does not look Gaussian (reject H0)')
5 Conclusion
In this post several ways were presented to check normal distribution. You can do this using graphical representations or statistical tests. I would always recommend several methods to use for the determination.