6 min read

Data Manipulation

1 Introduction

Data manipulation is an elementary component in the data science field that requires the most time, among other things. It is therefore worthwhile to be fit in this discipline.

For this post the dataset flight from the statistic platform “Kaggle” was used. You can download it from my GitHub Repository.

import pandas as pd
import numpy as np
flight = pd.read_csv("path/to/file/flight.csv")

2 Index

If you’ve worked with R before, you may not be used to working with an index. This is common in Python.

2.1 Resetting index

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c'])
df

df = df.reset_index()
df

df.rename(columns ={df.columns[0]: 'ABC'}, inplace = True)
df

df.index.tolist()

df['A'].tolist()

2.2 Resetting multiindex

# Create a multiindex 


index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])


data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37

# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

health_data.columns = health_data.columns.droplevel()
health_data = health_data.reset_index()
health_data

2.3 Setting index

Here we have the previously created data frame.

df

Now we would like to set an index again.

df.set_index([pd.Index([1, 2, 3])])

df.set_index('ABC')

df.set_index(['A', 'B'])

3 Modifying Columns

3.1 Rename Columns

show_rename = pd.DataFrame({'Name': ['1.Anton', '2.Susi', '3.Moni', '4.Renate'],
                   'Alter': [32,22,62,44],
                   'Gehalt': [4700, 2400,4500,2500]})
show_rename

show_rename.rename(columns ={show_rename.columns[2]: 'Monatsgehalt'}, inplace = True)
show_rename

show_rename = show_rename.rename(columns={'Name':'Mitarbeiter', 'Alter':'Betriebszugehörigkeit'})
show_rename

show_rename.columns = ['MA', 'Bzghk.', 'MGe.']
show_rename

3.1.1 add_prefix

show_prefix = show_rename.copy()
show_prefix

show_prefix.add_prefix('alte_MA_')

3.3 Add columns

data = {'country': ['Italy','Spain','Greece','France','Portugal'],
        'popu': [61, 46, 11, 65, 10],
        'percent': [0.83,0.63,0.15,0.88,0.14]}

df_MC = pd.DataFrame(data, index=['ITA', 'ESP', 'GRC', 'FRA', 'PRT'])
df_MC

Add a list as a new column

df_MC['capital city'] = ['Rome','Madrid','Athens','Paris','Lisbon']
df_MC

Add an array as a new column

ar = np.array([39,34,30,33,351])
ar
df_MC['Calling code'] = ar
df_MC

Add a Series array as a new column. When adding a Series data are automatically aligned based on index.

ser = pd.Series(['es','it','fr','pt','gr'], index = ['ESP','ITA','FRA','PRT','GRC'])
df_MC['Internet domain'] = ser
df_MC

3.4 Drop and Delete Columns

The drop-function don’t delete columns.

df_MC.drop(columns=['country', 'popu', 'percent'])

df_MC.head(2)

But del-function does this

del df_MC['Internet domain']
df_MC

For multiple deletion use drop-function + inplace = True

df_MC.drop(["popu", "percent", "Calling code"], axis = 1, inplace = True) 
df_MC.head(3)

3.5 Insert Columns

Now I want to have my previously deleted column “Internet domain” back at a certain point.

ser = pd.Series(['es','it','fr','pt','gr'], index = ['ESP','ITA','FRA','PRT','GRC']) 
#previously created syntax
df_MC.insert(1,'Internet domains',ser)
df_MC

3.6 Rearrange Columns

clist = ['capital city','Internet domains','country']
df_MC = df_MC[clist]
df_MC

You can also simply move the last column to the front. This is often the case, for example, when you make predictions, which you would like to put in the original dataframe and usually in the first place.

cols = list(df_MC.columns)
cols = [cols[-1]] + cols[:-1]
df_MC = df_MC[cols]
df_MC

4 Modifying Rows

# Preparations

df =(               
    flight
    .groupby(['Origin_Airport'])
    .agg({'Scheduled_Departure': 'mean', 
          'Dep_Delay': 'mean',
          'Scheduled_Arrival': 'mean',
          'Arrival_Delay': 'mean'})                                   
    .rename(columns={"Scheduled_Departure": "avg_Scheduled_Departure",
                    "Dep_Delay": "avg_Dep_Delay",
                    "Scheduled_Arrival": "avg_Scheduled_Arrival",
                    "Arrival_Delay": "avg_Arrival_Delay"})    
    .reset_index()                                                          
)

df.head(5)

4.1 Round each column

df.round(2).head(2)

4.2 Round columns differently within a df

df.round({'avg_Scheduled_Departure': 1, 
          'avg_Dep_Delay': 2,
         'avg_Scheduled_Arrival':3,
         'avg_Arrival_Delay':4}).head(2)

decimals = pd.Series([1, 2, 3, 4], index=['avg_Scheduled_Departure', 'avg_Dep_Delay', 'avg_Scheduled_Arrival', 'avg_Arrival_Delay'])
df.round(decimals).head(2)

4.3 Drop Duplicates

To get clean data it is often necessary to remove duplicates. We can do so with the drop_duplicates function. Have a look at this dataframe:

df_duplicates = pd.DataFrame({'Case': [1,2,3,4,5],
                   'Value': [5,5,5,7,8]})
df_duplicates

As we can see, there are several identical values in the ‘Value’ column. We do not want to have them like this. With keep=‘first’ we consider only the first value as unique and rest of the same values as duplicate.

df_subset_1 = df_duplicates.drop_duplicates(subset=['Value'], keep='first')
df_subset_1

With keep=‘last’ we consider only the last value as unique and rest of the same values as duplicate.

df_subset_2 = df_duplicates.drop_duplicates(subset=['Value'], keep='last')
df_subset_2

With keep=False we consider all of the same values as duplicates.

df_subset_3 = df_duplicates.drop_duplicates(subset=['Value'], keep=False)
df_subset_3

With the drop_duplicates function there is one more parameter that can be set: inplace. By default this is set to False. If we set this to True, the record does not have to be assigned to a separate object (as we have always done before).

df_duplicates.drop_duplicates(subset=['Value'], keep=False, inplace=True)

df_duplicates

5 Replacing Values

5.1 One by One

df = flight[['DayOfWeek']] 
df = df.replace(5, 'Friday')
df = df.replace(6, 'Saturday')
#and so on ...
df.head(5)

5.2 Collective replacement

df = flight[['DayOfWeek']]
vals_to_replace = {1:'Monday', 2:'Tuesday', 3:'Wednesday', 4:'Thursday', 5:'Friday', 6:'Saturday', 7:'Sunday'}

df['DayOfWeek'] = df['DayOfWeek'].map(vals_to_replace)
df.head()

5.3 Conditional replacement

For an example of conditional replacement have a look at this dataframe:

df = pd.DataFrame({'Name': ['Anton', 'Anton', 'Moni', 'Moni', 'Justus'],
                   'Alter': [32,22,np.NaN,62,18],
                   'Gehalt': [np.NaN, np.NaN,np.NaN,np.NaN,500]})
df

We want to check the names where the column ‘Gehalt’ is NaN.

df[df["Gehalt"].isnull() & (df["Name"] == 'Anton')]

Here we go ! Now we want to replace exspecially these NaNs with a salary of 2.000 for Anton.

df['Gehalt'] = np.where((df.Name == 'Anton'), 2000, df.Gehalt)
df

We can also use multiple conditions for filtering and replacement.

The code below shows a selection of the name (here Moni) for which no age is available. Now we want to replaces the NaNs for the salary.

df['Gehalt'] = np.where((df.Name == 'Moni') & (df.Alter.isna()), 3000, df.Gehalt)
df

Finally we replace the hits that we find under Moni with an age greater than 50.

df['Gehalt'] = np.where((df.Name == 'Moni') & (df.Alter > 50), 4000, df.Gehalt)
df

6 Function for colouring specific values

I always find it quite nice to be able to color-modify Python’s output so that you can immediately see important figures.

6.1 highest values

Sometimes it is useful, e.g. when you want to compare the performance values of algorithms during model training, the highest values are displayed in color.

colouring_df = pd.DataFrame({'Col1': [32,22,17,44],
                             'Col2': [32,22,62,28],
                             'Col3': [4700, 2400,4500,2500]})
colouring_df

colouring_df.style.highlight_max(color = 'lightgreen', axis = 0)

6.2 lowest values

Likewise, you can have even the lowest values displayed in color:

colouring_df.style.highlight_min(color = 'lightgreen', axis = 0)

6.3 highest-lowest values

If you want to highlight values from different columns with different conditions, you can do this as follows:

colouring_df.style.highlight_max(axis=0, 
                                color = 'lightgreen', 
                                subset=['Col1', 
                                        'Col2']).highlight_min(axis=0, 
                                                               color = 'lightgreen', 
                                                               subset=['Col3'])

Here I have highlighted the highest values from columns ‘Col1’ and ‘Col2’ and the lowest value from column ‘Col3’.

6.4 negative values

Here is an example of how to highlight negative values in a data set:

negative_values_df = pd.DataFrame({'Col1': [-32,22,-62,44],
                                   'Col2': [32,-22,62,-44],
                                   'Col3': [-4700, 2400,-4500,2500]})
negative_values_df

# Function for colouring(negative values red and positive values black)

def highlight_neg_values(s): 
    if s.dtype == np.object: 
        is_neg = [False for _ in range(s.shape[0])] 
    else: 
        is_neg = s < 0
    return ['color: red;' if cell else 'color:black' 
            for cell in is_neg] 
   
negative_values_df.style.apply(highlight_neg_values)

7 Conclusion

This was a small insight into the field of data manipulation. In subsequent posts, the topics of string manipulation and the handling of missing values will be shown.