13 min read

Dealing with missing values

1 Introduction

In the real world, there is virtually no record that has no missing values. Dealing with missing values can be done differently. In the following several methods will be presented how to deal with them.

2 Loading the Libraries and the Data

import pandas as pd
import numpy as np

from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

from sklearn.impute import KNNImputer

from sklearn.model_selection import train_test_split
import pickle as pk
import random
df = pd.DataFrame({'Name': ['Anton', 'Moni', np.NaN, 'Renate', 'Justus'],
                   'Age': [32,22,62,np.NaN,18],
                   'Salary': [np.NaN, np.NaN,4500,2500,3800],
                   'Job': ['Student', np.NaN, 'Manager', 'Teacher', 'Student']})
df

3 Checking for missing values

df.isnull().sum()

def mv_overview_func(df):
    '''
    Gives an overview of the total number and percentage of missing values in a data set
    
    Args:
        df (DataFrame): Dataframe to which the function is to be applied
        
    Returns:
        Overview of the total number and percentage of missing values
    '''
    # Total missing values
    mis_val = df.isnull().sum()
    # Percentage of missing values
    mis_val_percent = 100 * df.isnull().sum() / len(df)
    # Data Types
    data_types = df.dtypes
    # Make a table with the results
    mis_val_table = pd.concat([mis_val, mis_val_percent, data_types], axis=1)
    # Rename the columns
    mis_val_table_ren_columns = mis_val_table.rename(
    columns = {0 : 'Missing Values', 1 : '% of Total Values', 2 : 'Data Type'})
    # Sort the table by percentage of missing descending
    mis_val_table_ren_columns = mis_val_table_ren_columns[
        mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
    # Print some summary information
    print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
            " columns that have missing values.")
    # Return the dataframe with missing information
    return mis_val_table_ren_columns
mv_overview_func(df)

4 Droping of Missing Values

df_drop = df.copy()
df_drop

All rows with minimum one NaN will be dropped:

df_drop.dropna()

All rows from the defined columns with a NaN will be dropped:

df_drop.dropna(subset=['Name', 'Age'])

5 Imputations

5.1 for NUMERIC Features

5.1.1 Replace np.NaN with specific values

df_replace_1 = df.copy()
df_replace_1

Missing values from only one column (here ‘Name’) are replaced:

df_replace_1['Name'] = df_replace_1['Name'].fillna(0)
df_replace_1

Missing values from the complete dataset will be replaced:

df_replace_1.fillna(0, inplace=True)
df_replace_1

Does that make sense? Probably not. So let’s look at imputations that follow a certain logic.

5.1.2 Replace np.NaN with MEAN

A popular metric for replacing missing values is the use of mean.

df_replace_2 = df.copy()
df_replace_2

df_replace_2['Age'].mean()

Here all missing values of the column ‘Age’ are replaced by their mean value.

df_replace_2['Age'] = df_replace_2['Age'].fillna(df_replace_2['Age'].mean())
df_replace_2

scikit-learn - SimpleImputer

Always keep in mind that you will need all the steps you take to prepare for model training to make predictions later.

What do I mean by that exactly? If the data set you have available for model training already has missing values, it is quite possible that future data sets for which predictions are to be made will also contain missing values. In order for the prediction model to work, these missing values must be replaced by metrics that were also used in the model training.

df_replace_3 = df.copy()
df_replace_3

imp_age_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

imp_age_mean.fit(df_replace_3[['Age']])
df_replace_3['Age'] = imp_age_mean.transform(df_replace_3[['Age']])
df_replace_3

In the steps shown before, I used the .fit and .transform functions separately. If it’s not about model training, you can also combine these two steps and save yourself another line of code.

df_replace_4 = df.copy()
df_replace_4

imp_age_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

df_replace_4['Age'] = imp_age_mean.fit_transform(df_replace_4[['Age']])
df_replace_4

This way you can see which value is behind imp_age_mean concretely:

 imp_age_mean.statistics_

5.1.3 Replace np.NaN with MEDIAN

Other metrics such as the median can also be used instead of missing values:

df_replace_5 = df.copy()
df_replace_5

imp_age_median = SimpleImputer(missing_values=np.nan, strategy='median')

df_replace_5['Age'] = imp_age_median.fit_transform(df_replace_5[['Age']])
df_replace_5

imp_age_median.statistics_

5.1.4 Replace np.NaN with most_frequent

For some variables, it makes sense to use the most frequently occurring value for NaNs instead of mean or median.

df_replace_6 = df.copy()
df_replace_6

imp_age_mfreq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

df_replace_6['Age'] = imp_age_mfreq.fit_transform(df_replace_6[['Age']])
df_replace_6

Since there is no value in the variable ‘Age’ that occurs twice or more often, the lowest value is automatically taken. The same would apply if there were two equally frequent values.

5.2 for CATEGORICAL Features

5.2.1 Replace np.NaN with most_frequent

The most_frequent function can be used for numeric variables as well as categorical variables.

df_replace_7 = df.copy()
df_replace_7

imp_job_mfreq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

df_replace_7['Job'] = imp_job_mfreq.fit_transform(df_replace_7[['Job']])
df_replace_7

Here we see that with a frequency of 2, the job ‘student’ is the most common, so this is used for the missing value here.

imp_job_mfreq.statistics_

But what happens if we just don’t have a most frequent value in a categorical column like in our example within the column ‘Name’?

df_replace_8 = df.copy()
df_replace_8

imp_name_mfreq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

df_replace_8['Name'] = imp_name_mfreq.fit_transform(df_replace_8[['Name']])
df_replace_8

Again, the principle that the lowest value is used applies. In our example, this is the name Anton, since it begins with A and thus comes before all other names in the alphabet.

5.2.2 Replace np.NaN with specific values

df_replace_9 = df.copy()
df_replace_9

However, we also have the option of using certain values:

imp_job_const = SimpleImputer(missing_values=np.nan, 
                              strategy='constant',
                              fill_value='others')

df_replace_9['Job'] = imp_job_const.fit_transform(df_replace_9[['Job']])
df_replace_9

5.3 for specific Values

Not only a certain kind of values like NaN values can be replaced, this is also possible with specific values.

5.3.1 single values

For the following example we take the last version of the last used data set, here df_replace_9:

df_replace_9

rep_job_const = SimpleImputer(missing_values='others', 
                              strategy='constant',
                              fill_value='not_in_scope')

df_replace_9['Job'] = rep_job_const.fit_transform(df_replace_9[['Job']])
df_replace_9

As we can see, ‘others’ became ‘not_in_scope’.

5.3.2 multiple values

Unfortunately, we cannot work with lists for multiple values. But with the use of the pipeline function it works. We use for our following example again the last state of the dataset ‘df_replace_9’:

rep_pipe = Pipeline([('si1',SimpleImputer(missing_values = 'Manager', 
                                          strategy='constant',
                                          fill_value='not_relevant')),
                     ('si2', SimpleImputer(missing_values = 'Teacher', 
                                           strategy='constant', 
                                           fill_value='not_relevant'))])

df_replace_9['Job'] = rep_pipe.fit_transform(df_replace_9[['Job']])
df_replace_9

In my opinion, however, this approach has two disadvantages. Firstly, the values used are not saved (so cannot be reused automatically) and secondly, this is a lot of code to write. With an if-else function you would be faster:

def rep_func(col):

    if col == 'Student':
        return 'useless'
    if col == 'not_relevant':
        return 'useless'
    else:
        return 'useless'

df_replace_9['Job'] = df_replace_9['Job'].apply(rep_func)
df_replace_9

But again, the values used cannot be easily reused. You would have to create your own dictionary.

6 Further Imputation Methods

In the following chapter I would like to present some more imputation methods. They differ from the previous ones because they use different values instead of NaN values and not specific values as before.

In some cases, this can lead to getting a little closer to the truth and thus improve the model training.

6.1 with ffill

Here, the missing value is replaced by the preceding non-missing value.

df_replace_10 = df.copy()
df_replace_10

df_replace_10['Age'] = df_replace_10['Age'].fillna(method='ffill')
df_replace_10

The ffill function also works for categorical variables.

df_replace_10['Job'] = df_replace_10['Job'].fillna(method='ffill')
df_replace_10

The condition here is that there is a first value. Let’s have a look at the column ‘Salary’. Here we have two missing values right at the beginning. Here ffill does not work:

df_replace_10['Salary'] = df_replace_10['Salary'].fillna(method='ffill')
df_replace_10

6.2 with backfill

What does not work with ffill works with backfill. Backfill replaces the missing value with the upcoming non-missing value.

df_replace_10['Salary'] = df_replace_10['Salary'].fillna(method='backfill')
df_replace_10

6.3 Note on this Chapter

Due to the fact that different values are used for the missing values, it is not possible to define a uniform metric that should replace future missing values from the column.

However, one can proceed as follows for a model training. You start by using the functions ffill or backfill and then calculate a desired metric of your choice (e.g. mean) and save it for future missing values from the respective column.

I will explain the just described procedure in chapter 8.2 in more detail.

7 KNNImputer

Here, the KNN algorithm is used to replace missing values. If you want to know how KNN works exactly, check out this post of mine: Introduction to KNN Classifier. For this chapter, I have again created a sample data set:

df_knn = pd.DataFrame({'Name': ['Anton', 'Moni', 'Sven', 'Renate', 'Justus', 
                                'Sarah', 'Jon', 'Alex', 'Jenny', 'Jo'],
                       'Age': [32,22,62,np.NaN,18,63,np.NaN,44,23,71],
                       'Salary': [4000, np.NaN,4500,2500,3800,5500,7000,np.NaN,4800,3700]})
df_knn

7.1 on single columns

df_knn_1 = df_knn.copy()


imp_age_knn = KNNImputer(n_neighbors=2)

df_knn_1['Age'] = imp_age_knn.fit_transform(df_knn_1[['Age']])
df_knn_1

imp_salary_knn = KNNImputer(n_neighbors=2)

df_knn_1['Salary'] = imp_salary_knn.fit_transform(df_knn_1[['Salary']])
df_knn_1

So far so good. However, this is not how the KNNImputer is used in practice. I will show you why in the following chapter.

7.2 on multiple columns

df_knn_2 = df_knn.copy()
df_knn_2

imp_age_salary_knn = KNNImputer(n_neighbors=2)

df_knn_2[['Age', 'Salary']] = imp_age_salary_knn.fit_transform(df_knn_2[['Age', 'Salary']])
df_knn_2

As you can see from the comparison below, the two methods use different values from the KNNImputer (see index 1,3 and 7).

print()
print('df_knn_1')
print()
print(df_knn_1)
print('-------------------------')
print()
print('df_knn_2')
print()
print(df_knn_2)

It should be mentioned here that sometimes it is better to choose this statistical approach in order to achieve better results in the later model training.

7.3 Note on this Chapter

KNNImputer stores the calculated metrics for each column added to it. The number and the order of the columns must remain the same!

8 Imputation in Practice

As already announced, I would like to show again how I use the replacement of missing values in practice during model training. It should be noted that I use a simple illustrative example below. In practice, there would most likely be additional steps like using encoders or feature scaling. This will be omitted at this point. But I will show in which case which order should be followed.

In the following, I use a modified version of the created dataset from the KNN example:

df_practice = df_knn.copy()
Target_Var = [0,0,1,0,1,1,0,1,1,0]
df_practice['Target_Var'] = Target_Var

df_practice

8.1 SimpleImputer in Practice

8.1.1 Train-Test Split

df_practice_simpl_imp = df_practice.copy()
df_practice_simpl_imp

In model training, I first divide the data set into a training part and a test part.

x = df_practice_simpl_imp.drop('Target_Var', axis=1)
y = df_practice_simpl_imp['Target_Var']

trainX, testX, trainY, testY = train_test_split(x, y, test_size = 0.2)
print()
print('trainX')
print()
print(trainX)
print('-------------------------')
print()
print('testX')
print()
print(testX)

8.1.2 Fit&Transform (trainX)

Then I check if there are any Missing Values in trainX.

trainX.isnull().sum()

As we can see, we need to replace missing values in the columns ‘Age’ and ‘Salary’. For this I use the SimpleImputer with the strategy=‘mean’.

# Fit and Transform trainX column 'Age' with strategy='mean'
imp_age_mean1 = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_age_mean1.fit(trainX[['Age']])
trainX['Age'] = imp_age_mean1.transform(trainX[['Age']])

# Fit and Transform trainX column 'Salary' with strategy='mean'
imp_salary_mean1 = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_salary_mean1.fit(trainX[['Salary']])
trainX['Salary'] = imp_salary_mean1.transform(trainX[['Salary']])

print(trainX)
print()
print('Number of missing values:')
print(trainX.isnull().sum())

When I use the SimpleImputer, I save it separately to be able to use it again later.

pk.dump(imp_age_mean1, open('imp_age_mean1.pkl', 'wb'))
pk.dump(imp_salary_mean1, open('imp_salary_mean1.pkl', 'wb'))

8.1.3 Model Training

I won’t do the model training at this point, because that would still require me to either remove the categorical variable ‘name’ or convert it to a numeric one. This would only cause confusion at this point. Let’s assume we have done the model training like this.

dt = DecisionTreeClassifier()
dt.fit(trainX, trainY)

The execution of the prediction function (apart from the still existing categorical variable) would not work like this.

y_pred = df.predict(testX)

Because we also have Missing Values in the testX part.

testX.isnull().sum()

In order to be able to test the created model, we also need to replace them. For this purpose, we saved the metrics of the two SimpleImputers used in the previous step and can use them again here.

8.1.4 Transform (testX)

In the following, I will show the syntax to replace missing values for both columns (‘Age’ and ‘Salary’). I am aware that in this example only the ‘Age’ column contains a missing value. But in practice the data sets are usually larger than 10 observations.

# Transform testX column 'Age'
testX['Age'] = imp_age_mean1.transform(testX[['Age']])

# Transform testX column 'Salary'
testX['Salary'] = imp_salary_mean1.transform(testX[['Salary']])

print(testX)
print()
print('Number of missing values:')
print(testX.isnull().sum())

Now the test of our model training would also work.

8.2 ffill & backfill in Practice

As already noted in chapter 6.3, we cannot directly save a metric for further use with the ffill or backfill method. Therefore, in this part I show how I proceed in such a situation.

Here I will not go into each step individually, as they have been sufficiently explained in the previous chapter.

8.2.1 Train-Test Split

df_practice_ffill_bfill = df_practice.copy()
df_practice_ffill_bfill

x = df_practice_ffill_bfill.drop('Target_Var', axis=1)
y = df_practice_ffill_bfill['Target_Var']

trainX, testX, trainY, testY = train_test_split(x, y, test_size = 0.2)
print()
print('trainX')
print()
print(trainX)
print('-------------------------')
print()
print('testX')
print()
print(testX)

8.2.2 Use ffill or backfill

Now I will replace the missing values in our example with the ffill method.

# ffill column 'Age'
trainX['Age'] = trainX['Age'].fillna(method='ffill')
# ffill column 'Salary'
trainX['Salary'] = trainX['Salary'].fillna(method='ffill')

print(trainX)
print()
print('Number of missing values:')
print(trainX.isnull().sum())

8.2.3 Fit (trainX)

# Fit trainX column 'Age' with strategy='mean'
imp_age_mean2 = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_age_mean2.fit(trainX[['Age']])

# Fit trainX column 'Salary' with strategy='mean'
imp_salary_mean2 = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_salary_mean2.fit(trainX[['Salary']])
pk.dump(imp_age_mean, open('imp_age_mean2.pkl', 'wb'))
pk.dump(imp_salary_mean, open('imp_salary_mean2.pkl', 'wb'))

8.2.4 Transform (testX)

I’ll leave out the part about model training at this point, since it would only be fictitiously presented anyway.

testX.isnull().sum()

# Transform testX column 'Age'
testX['Age'] = imp_age_mean2.transform(testX[['Age']])

# Transform testX column 'Salary'
testX['Salary'] = imp_salary_mean2.transform(testX[['Salary']])

print(testX)
print()
print('Number of missing values:')
print(testX.isnull().sum())

8.3 KNNImputer in Practice

Now we come to the last method described in this post for replacing missing values in practice.

8.3.1 Train-Test Split

df_practice_knn = df_practice.copy()
df_practice_knn

x = df_practice_knn.drop('Target_Var', axis=1)
y = df_practice_knn['Target_Var']

trainX, testX, trainY, testY = train_test_split(x, y, test_size = 0.2)
print()
print('trainX')
print()
print(trainX)
print('-------------------------')
print()
print('testX')
print()
print(testX)

8.3.2 Fit&Transform (trainX)

# Fit and Transform trainX column 'Age' and 'Salary'
imp_age_salary_knn1 = KNNImputer(n_neighbors=2)
trainX[['Age', 'Salary']] = imp_age_salary_knn1.fit_transform(trainX[['Age', 'Salary']])

print(trainX)
print()
print('Number of missing values:')
print(trainX.isnull().sum())

pk.dump(imp_age_salary_knn1, open('imp_age_salary_knn1.pkl', 'wb'))

8.3.3 Transform (testX)

I’ll leave out again the part about model training at this point, since it would only be fictitiously presented anyway.

testX.isnull().sum()

# Transform testX column 'Age' and 'Salary'
testX[['Age', 'Salary']] = imp_age_salary_knn1.transform(testX[['Age', 'Salary']])

print(testX)
print()
print('Number of missing values:')
print(testX.isnull().sum())

8.4 Recommendation

What works for me in practice is a method from statistics that can be used when the data of a variable is normally distributed.

We know from normal distributions (Bell Curve) that 68% of the data lie between Z-1 and Z1.

Source: SimplyPsychology

That is, they have a mean value of 0 +- 1sd with a standard normal distribution.

Source: SimplyPsychology

Therefore, for a variable with a normal distribution, we can replace the missing values with random values that have a range from mean - 1sd to mean + 1sd.

This method, is a little more cumbersome than the functions we used before, but it provides slightly more accurate values.

df_recom = df_knn.copy()
df_recom = df_recom[['Name', 'Age']]
df_recom

mean_age = df_recom['Age'].mean()
sd_age = df_recom['Age'].std()

print('Mean of columne "Age": ' + str(mean_age))
print('Standard deviation of columne "Age": ' + str(sd_age))

Fast saving of the values:

pk.dump(mean_age, open('mean_age.pkl', 'wb'))
pk.dump(sd_age, open('sd_age.pkl', 'wb'))

With the random.uniform function I can output floats for a certain range.

random.uniform(mean_age-sd_age, 
               mean_age+sd_age)

print('Lower limit of the range: ' + str(mean_age-sd_age))
print('Upper limit of the range: ' + str(mean_age+sd_age))

def fill_missings_gaussian_func(col, mean, sd):
    if np.isnan(col) == True: 
        col = random.uniform(mean-sd, mean+sd)
    else:
         col = col
    return col
df_recom['Age'] = df_recom['Age'].apply(fill_missings_gaussian_func, args=(mean_age, sd_age)) 
df_recom

Voilá, we have inserted different new values for the missing values of the column ‘Age’, which are between the defined upper and lower limit of the rage. Now, if you want to be very precise, you can round the ‘Age’ column to whole numbers to make it consistent.

9 Conclusion

In this post, I have shown different methods of replacing missing values in a dataset in a useful way. Furthermore, I have shown how these procedures should be applied in practice during a model training.