2 min read

Python's Pipe - Operator

1 Introduction

Anyone who has ever worked with R probably knows the very useful pipe operator %>%. Python also has a similar one that will be presented in different versions below.

For this post the dataset flight from the statistic platform “Kaggle” was used. You can download it from my GitHub Repository.

Loading the libraries and the data

import pandas as pd
flight = pd.read_csv("path/to/file/flight.csv")

2 Python’s Pipe - Operator like R’s %>%

df = (
    flight
    [['DayOfWeek', 'UniqueCarrier', 'Origin_Airport']]
)

df.head()

2.1 Filter and select

df = (
    flight
    [flight.DayOfWeek > 3]
    [['DayOfWeek', 'UniqueCarrier', 'Origin_Airport']]
)

df.head()

2.2 Multiple filter and select

df = (
    flight
    [(flight.DayOfWeek > 3) & (flight.Origin_Airport == 'JFK')]
    [['DayOfWeek', 'UniqueCarrier', 'Origin_Airport']]
)

df.head()

2.3 Sample and sort

(
    flight
    .sample(n=3)
    .sort_values('DayOfWeek')
    [['DayOfWeek', 'UniqueCarrier', 'Origin_Airport']]
)

2.4 Multiple group by and summarize

df = (               
    flight
    .groupby(['Origin_Airport', 'DayOfWeek'])
    .agg({'Scheduled_Departure': 'mean'})                                   
    # agg = summarize
    .rename(columns={"Scheduled_Departure": "avg_Scheduled_Departure"})     
    # due to agg we have to rename
    .reset_index()                                                          
    # agg returns a MultiIndex therefore reset_index()
    .round({'avg_Scheduled_Departure': 2})
)

df.head(12)

2.5 Group by and multiple summarize

df = (               
    flight
    .groupby(['Origin_Airport'])
    .agg({'Scheduled_Departure': 'mean', 
          'Dep_Delay': 'mean',
          'Scheduled_Arrival': 'mean',
          'Arrival_Delay': 'mean'})                                   
    .rename(columns={"Scheduled_Departure": "avg_Scheduled_Departure",
                    "Dep_Delay": "avg_Dep_Delay",
                    "Scheduled_Arrival": "avg_Scheduled_Arriva",
                    "Arrival_Delay": "avg_Arrival_Delay"})    
    .reset_index()
    .round({'avg_Scheduled_Departure': 2, 
            'avg_Dep_Delay': 2,
            'avg_Scheduled_Arriva': 2,
            'avg_Arrival_Delay':2})
)

df.head(12)

3 Conclusion

In my opinion, the Pipe Operator is a very useful feature which can be used in a variety of ways.