4 min read

Data Wrangling

1 Introduction

Never stop learning !

The entry into the field of data science with “R / R-Studio” was a smart matter. Now it’s time and for each Data Scientist advisable to learn another scripting language.

Let’s start with Python!

For this post the dataset flight from the statistic platform “Kaggle” was used. You can download it from my GitHub Repository.

2 Loading the libraries and the data

import pandas as pd
import numpy as np
flight = pd.read_csv("flight.csv")

3 Overview of the data

With the following commands it is possible to get a quick overview of his available data.

flight.head()

flight.tail()

flight.shape

flight.columns

flight['Origin_Airport'].value_counts().head().T

flight.dtypes.head()

We can also output the unique values of a column.

#List unique values in the flight['Origin_Airport'] column
flight.Origin_Airport.unique()

Now let’s take a look at the unique values of several columns. For this purpose, I will select 4 categorical columns from the data set as an example:

flight_subset = flight[['UniqueCarrier','Tai_lNum','Origin_Airport','Origin_City_Name']]
flight_subset

Now I can use a for loop to display the number of contained labels:

for feature in flight_subset.columns[:]:
    print(feature, ':', len(flight_subset[feature].unique()), 'labels')

4 Get some statistics

flight.describe()

flight['Dep_Delay'].agg(['mean', 'median', 'std', 'min', 'max']).reset_index()

flight[['Origin_Airport', 'Departure_Time', 'Dep_Delay', 'Carrier_Delay']].groupby('Origin_Airport').mean().head()

5 Select data

5.1 Easy Selection

flight[['Year']].head()

flight[['Year', 'WeatherDelay']].head()

# Select specific rows
flight[1:4]

# Select specific rows & columns
flight.loc[1:4, ['Year', 'WeatherDelay', 'Flight_Date']]

# Select all columns from Col_X to Col_Y
flight.loc[:,'Year':'DayOfWeek'].head()

# Select all columns from Col_X to Col_Y and Col_Z
flight.loc[:,'Year':'DayOfWeek'].join(flight.loc[:,'Tai_lNum']).head()

# Select all columns from Col_X to Col_Y and from Col_Z to Col_*
flight.loc[:,'Year':'DayOfWeek'].join(flight.loc[:,'Tai_lNum':'Origin_Airport']).head()

5.2 Conditional Selection

flight[(flight["Distance"] >= 3000) & (flight["DayOfWeek"] == 1) & (flight["Flight_Date"] == '11/07/2016')]

flight[(flight.Origin_Airport == 'ATL') | (flight.Origin_Airport == 'BOS')]['Origin_Airport']

# If you want to see how many cases are affected use Shape[0]

flight[(flight["Distance"] >= 3000)].shape[0] 

5.3 Set option

With the set option function one can determine, how many lines and columns should be issued.

flight.head()

pd.set_option('display.max_rows', 2)
pd.set_option('display.max_columns', 2)
flight.head(200)

# Don't forget to reset the set options if they are no longer required.

pd.reset_option('all')

6 Dropping Values

6.1 Dropping Columns

df = pd.DataFrame({'Name': ['Anton', 'Moni', np.NaN, 'Renate', 'Justus'],
                   'Age': [32,22,62,np.NaN,18],
                   'Salary': [np.NaN, np.NaN,4500,2500,3800],
                   'Job': ['Student', np.NaN, 'Manager', 'Teacher', 'Student']})
df

reduced_df = df.drop(['Job'], axis=1)
reduced_df.head()

reduced_df2 = df.drop(['Salary', 'Job'], axis=1)
reduced_df2.head()

# You can also use a list to excluede columns
col_to_exclude = ['Salary', 'Job']

reduced_df_with_list = df.drop(col_to_exclude, axis=1)
reduced_df_with_list.head()

6.2 Dropping NaN Values

df

#Dropping all NaN values from column 'Name'
df.dropna(subset=['Name'])

#Dropping all NaN values from the columns 'Salary' and 'Job' if there is min. 1
df.dropna(subset=['Salary', 'Job'])

6.3 NaN Values vs. Null Values

NaN values and zero values are not the same thing. This becomes clear from the examples below, so that you do not mistakenly follow a false assumption.

df_NaN_vs_Null = pd.DataFrame({'AIRLINE': ['AS', 'LH', 'KE'],
                               'VALUE': [1, 0, np.NAN]})
df_NaN_vs_Null

# The isna() function does its job well
df_NaN_vs_Null[df_NaN_vs_Null['VALUE'].isna()]

# The isnull() function also looks for NaN values not for NULL values!
df_NaN_vs_Null[df_NaN_vs_Null['VALUE'].isnull()]

# For Null values you have to select the respective column like this:
df_NaN_vs_Null[(df_NaN_vs_Null["VALUE"] == 0)]

# If you are looking for both (NaN and Null Values) use this method:
df_NaN_vs_Null[(df_NaN_vs_Null["VALUE"] == 0) | (df_NaN_vs_Null["VALUE"].isnull())]

7 Filtering Values

Let’s use this dummy dataset:

df = pd.DataFrame({'Name': ['Maria', 'Marc', 'Julia', 'Mike', 'Sarah', 
                            'Sven', 'Mel', 'Alex', 'John', 'Marlene'],
                   'Favorite_Fruit': ['Banana', 'Apple', 'Melon', 'Peach', 'Grape', 
                                      'Melon', 'Orange', 'Banana', 'Melon', 'Apple']})
df

7.1 Filter with Lists

value_list = ["Apple", "Melon"]

boolean_value_list = df['Favorite_Fruit'].isin(value_list)
filtered_df = df[boolean_value_list]
filtered_df

7.2 Exclude certain values

value_list = ["Apple", "Melon"]

inverse_boolean_value_list = ~df.Favorite_Fruit.isin(value_list)
inverse_filtered_df = df[inverse_boolean_value_list]
inverse_filtered_df

8 Working with Lists

8.1 Creation of Lists

df

# Getting a list of unique values from a specific column

unique_list = df['Favorite_Fruit'].unique().tolist()
unique_list

# If you would like to get a list of the columns within a df use the to_list() function

column_list = df.columns.to_list()
column_list

8.2 Comparing Lists

list_A = ['A', 
          'B', 'C', 'D']

list_B = ['B', 'C', 'D', 
          'E']
# Elements in A not in B
filtered_list = list(set(list_A) - set(list_B))
filtered_list

# Elements in B not in A
filtered_list = list(set(list_B) - set(list_A))
filtered_list

# Elements that occur in both lists (common elements)
filtered_list = list(set(list_A).intersection(list_B))
# Also works: filtered_list = list(set(list_B).intersection(list_A))

filtered_list

# Elements that just occur in one of both lists (not common elements)
filtered_list = list(set(list_A) ^ set(list_B))
filtered_list

9 Conclusion

Data wrangling is one of the most important disciplines in the field of data science. This was just a small sample of what is possible.