4 min read

Reshape a pandas DataFrame

1 Introduction

After merging data (data management), we now come to the topic of how to reshape DataFrames.

2 Import the libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

3 Import the data

df = pd.DataFrame({'Name': ['Alice', 'Mike', 'Sven', 'Julia'],
                   'Age': [25,35,22,24],
                   'Height': [160,175,180,165],
                   'Weight': [55,80,85,57],
                   'State': ['Texas', 'Boston', 'Marquette', 'Alaska']})

4 Reshape a pandas DataFrame

4.1 stack()

Stack method works with the MultiIndex objects in DataFrame, it returning a DataFrame with an index with a new inner-most level of row labels. It changes the wide table to a long table.

In short: The stack function of pandas is used for stacking the levels from columns to index.

print('Original DataFrame:')
df

df_stacked = df.stack()
df_stacked = pd.DataFrame(df_stacked)
df_stacked

Of course, you can also undo this redesign.

df_unstacked = df_stacked.unstack()
df_unstacked

4.1.1 Application example

How can you use this function meaningfully in real life?

Information from rows

For some analysis methods it is necessary to have the values per line in one column available separately. Have a look at this dataframe:

df_appl_1 = pd.DataFrame({'title': ['A', 'B', 'C'],
                          'Runtime': [81,50,120],
                          'genres': ['animation, comedy, family', 'romance, comedy', 'comedy']})
df_appl_1

The different genres that apply to each film are unfortunately summarized in one column. Now we want the genres to be listed per line. For example, if a film is attributed three genres, I want the film to be listed in three separate lines, each with one of the three genres attributed. I can achieve this in 4 steps:

# Step 1
# We start with creating a new dataframe from the series with title as the index
new_df = pd.DataFrame(df_appl_1.genres.str.split(',').tolist(), index=df_appl_1.title).stack()
new_df

# Step 2
# We now want to get rid of the secondary index
# To do this, we will make title as a column (it can't be an index since the values will be duplicate)
new_df = new_df.reset_index([0, 'title'])
new_df

# Step 3
# Now we have to set the column names as we want them
new_df.columns = ['title', 'genre']
new_df

# Step 4
# The final step is to add the remaining information from the original record to the newly created record
final = pd.merge(new_df, df_appl_1[['title', 'Runtime']], on='title', how='left')
final

Here we go !

Information from columns

Just as often it can happen that the structure of the data (e.g. when joining datasets) does not fit. Have a look at this case:

df_appl_2 = pd.DataFrame({'Year': [2019, 2019, 2020, 2020],
                          'Season': ['Summer', 'Winter','Summer','Winter'],
                          'Rainfall_mm': [185, 2500, 320, 1800],
                          'Sunshine_h': [800, 500, 900, 600]})

df_appl_2 = df_appl_2.groupby(['Year', 'Season']).sum()
df_appl_2

In such a case the stack function can also be used:

final2 = df_appl_2.stack().reset_index()
final2.columns = ['Year', 'Season',
                  'Weather_Information', 
                  'Value']
final2

4.2 melt()

Pandas melt() function is used for unpivoting a DataFrame from wide to long format.

Melt in pandas reshape dataframe from wide format to long format. It uses the id_vars[‘Column_Name’] for melt the dataframe by column names.

I will always show the original dataset so that you can better understand the melt commands.

print('Original DataFrame:')
df

df_melt = df.melt()
df_melt

print('Original DataFrame:')
df

df_melt2 = df.melt(id_vars=['Name'])
df_melt2

print('Original DataFrame:')
df

df_melt3 = df.melt(id_vars=['Name', 'State'])
df_melt3

Use the id_vars[‘col_names’] for melt the dataframe by column names and id_value_vars[‘col_names’] as the value variable.

print('Original DataFrame:')
df

df_melt4 = df.melt(id_vars=['Name'], value_vars=['Age'])
df_melt4

With var_name and value_name you also have the possibility to name the two generated columns accordingly:

df_melt4 = df.melt(id_vars=['Name'], value_vars=['Age'],
                  var_name ='Variable_column', value_name ='Value_column')
df_melt4

4.2.1 Application example

Here again an example how melt can be used in practice. We have the following data set:

df_appl_3 = pd.DataFrame({'Age': [24,22,26],
                          'Work_Experience': [7, 5, 8]})
df_appl_3

Our goal is to create a boxplot with seaborn.

data_melt = pd.melt(df_appl_3)
data_melt

sns.boxplot(x="variable", y="value", data=data_melt)
plt.title("Boxplot",fontweight='bold')
plt.xlabel("Columns",fontweight='bold')
plt.ylabel("Years",fontweight='bold')

plt.show()

The two commands shown before do not have to be executed separately. It is also possible to write them together:

sns.boxplot(x="variable", y="value", data=pd.melt(df_appl_3))
plt.title("Boxplot2",fontweight='bold')
plt.xlabel("Columns",fontweight='bold')
plt.ylabel("Years",fontweight='bold')

plt.show()

5 Comparison of stack() and melt()

To better understand the different functionality of the two functions stack() and melt() I have compared the two outputs again:

print('Original DataFrame:')
df

Comparison of stack() and melt():

6 Conclusion

In this post I have shown how to use the stack() and melt() functions to reshape panda’s DataFrame accordingly.