1 Introduction

Goupby is one of the most used functions in data analysis. Therefore, it is worth to take a closer look at their functioning.

For this post the dataset flight from the statistic platform “Kaggle” was used. You can download it from my GitHub Repository.

2 Loading the libraries and the data

import pandas as pd
import numpy as np

flight = pd.read_csv("path/to/file/flight.csv")

3 Group by

flight.groupby('Origin_Airport').size().head()

flight.groupby(['Origin_Airport','DayOfWeek']).size().head(17).T

flight.groupby(['Origin_Airport']).get_group('BOS').head()               
#add. Filter on 'BOS'

3.1 with size

df = pd.DataFrame({"Person":
                   ["John", "Myla", "Lewis", "John", "Myla"],
                   "Age": [24., np.nan, 21., 33, 26],
                   "Single": [False, True, True, True, False]})
df

df[['Single', 'Age']].groupby(['Single']).size()

3.2 with count

df[['Single', 'Age']].groupby(['Single']).count()     
#The count function don't consicer NaN values

3.2.1 Count Non - Zero Observations

Vertical count

df = pd.DataFrame({"Person":
                   ["Männlich", "Weiblich", "Männlich", "Männlich", "Weiblich", "Männlich", "Weiblich",                     "Männlich", "Weiblich", "Männlich", "Weiblich", "Männlich", "Weiblich"],
                    "Verspätung in Min.": [0, 0, 4., 0, 5, 1, 0, 0, 11, 5, 4, 0, 9]})
df.head(6)

df['Verspätet?'] = np.where(df['Verspätung in Min.'] > 0, 1, 0)
df[['Person', 'Verspätet?']].groupby(['Person']).sum()

Horizontal count

df = pd.DataFrame({"Person":
                   ["Person 1", "Person 2", "Person 3"],
                   "MZP1": 
                   [0, 2, 4],
                   "MZP2": 
                   [0, 3, 6],
                   "MZP3": 
                   [1, 7, 0]})
df.head()

df2 = df[['MZP1', 'MZP2', 'MZP3']]
df2['Zwischensumme'] = df.astype(bool).sum(axis=1)
df2['Verspätungen'] = df2.Zwischensumme - 1
df2

astype(bool).sum(axis=1) also counts the index (and categorical variables if any). Therefore select and “Zwischensmme - 1”

df['Anzahl Verspätungen'] = df2.Verspätungen
df = df[['Person', 'Anzahl Verspätungen']].sort_values(by='Anzahl Verspätungen', ascending=False)
df

3.3 with sum

df[['Single', 'Age']].groupby(['Single']).sum().reset_index()

3.4 with nunique

df.groupby(['Single']).nunique()       
#nunique counts characteristics within the respective sorting (without NAs)

3.5 with mean

df[['Single', 'Age']].groupby(['Single']).mean().reset_index()

3.6 with agg.

df[['Single', 'Age']].groupby(['Single']).agg(['mean', 'median', 'std', 'min', 'max']).reset_index()

df_rank = flight.groupby('Origin_Airport')      #another way

df_descriptive = df_rank['Dep_Delay'].agg(['mean', 'median', 'std', 'min', 'max']).reset_index()


# Renaming Pandas Dataframe Columns
df_descriptive = df_descriptive.rename(columns={'Origin_Airport':'Origin Airport', 'mean':'Mean', 'median':'Median', 'std':'Standard Deviation', 'min':'Minimum', 'max': 'Maximum'})

df_descriptive.head()

4 Convert the group_by output to a dataframe

Sometimes we want to further analyze the output of a group_by command. For this purpose it is suitable to transform the output into such a format.

For this I create an example data set again:

df = pd.DataFrame({"Person": ["John", "John", "John", "Myla", "Myla"],
                   "consumption": [40, 10, 40, 5, 15]})
df

With most group_by commands (mean, sum and count) the format dataframe is already stored. Only some columns are still pivoted. This can be adjusted as follows.

grouped_df = df.groupby('Person').mean()
grouped_df

reset_df = grouped_df.reset_index()
reset_df

For example if we use group_by with size we get the following output:

grouped_df_2 = df.groupby('Person').size()
grouped_df_2

But we can also solve this problem with reset_index(). Even better, we can even give the name the column should take.

reset_df_2 = grouped_df_2.reset_index(name='number_of_observations')
reset_df_2

5 Conclusion

This was a small insight, how the groupby function works.

The use of the groupby function