6 min read

Machine Learning Pipelines

1 Introduction

Some time ago I had written the post The Data Science Process (CRISP-DM), which was about the correct development of Machine Learning algorithms. As you have seen here, this is quite a time-consuming matter if done correctly.

In order to quickly check which algorithm fits the data best, it is recommended to use machine learning pipelines. Once you have found a promising algorithm, you can start fine tuning with it and go through the process as described here.

For this post the dataset bird from the statistic platform “Kaggle” was used. You can download it from my “GitHub Repository”.

2 Loading the libraries and classes

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split


from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler

from sklearn.decomposition import PCA


from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.pipeline import Pipeline

from sklearn.metrics import accuracy_score
class Color:
   PURPLE = '\033[95m'
   CYAN = '\033[96m'
   DARKCYAN = '\033[36m'
   BLUE = '\033[94m'
   GREEN = '\033[92m'
   YELLOW = '\033[93m'
   RED = '\033[91m'
   BOLD = '\033[1m'
   UNDERLINE = '\033[4m'
   END = '\033[0m'

3 Loading the data

bird_df = pd.read_csv('bird.csv').dropna()
bird_df

Description of predictors:

  • Length and Diameter of Humerus
  • Length and Diameter of Ulna
  • Length and Diameter of Femur
  • Length and Diameter of Tibiotarsus
  • Length and Diameter of Tarsometatarsus
bird_df['type'].value_counts()

Description of the target variable:

  • SW: Swimming Birds
  • W: Wading Birds
  • T: Terrestrial Birds
  • R: Raptors
  • P: Scansorial Birds
  • SO: Singing Birds
bird_df['type'].nunique()

x = bird_df.drop(['type', 'id'], axis=1)
y = bird_df['type']

trainX, testX, trainY, testY = train_test_split(x, y, test_size = 0.2)

4 ML Pipelines

4.1 A simple Pipeline

Let’s start with a simple pipeline.

In the following, I would like to perform a classification of bird species using Logistic Regression. For this purpose, the data should be scaled beforehand using the StandardScaler of scikit-learn.

Creation of the pipeline:

pipe_lr = Pipeline([
    ('ss', StandardScaler()),
    ('lr', LogisticRegression())
    ])

Fit and Evaluate the Pipeline:

pipe_lr.fit(trainX, trainY)
y_pred = pipe_lr.predict(testX)


print('Test Accuracy: {:.2f}'.format(accuracy_score(testY, y_pred)))

OK, .75% not bad. Let’s see if we can improve the result by choosing a different scaler.

4.2 Determination of the best Scaler

4.2.1 Creation of the Pipeline

pipe_lr_wo = Pipeline([
    ('lr', LogisticRegression())
    ])

pipe_lr_ss = Pipeline([
    ('ss', StandardScaler()),
    ('lr', LogisticRegression())
    ])

pipe_lr_mms = Pipeline([
    ('mms', MinMaxScaler()),
    ('lr', LogisticRegression())
    ])

pipe_lr_rs = Pipeline([
    ('rs', RobustScaler()),
    ('lr', LogisticRegression())
    ])

4.2.2 Creation of a Pipeline Dictionary

To be able to present the later results better, I always create a suitable dictionary at this point.

pipe_dic = {
    0: 'LogReg wo scaler',
    1: 'LogReg with StandardScaler',
    2: 'LogReg with MinMaxScaler',
    3: 'LogReg with RobustScaler',
    }

pipe_dic

4.2.3 Fit the Pipeline

To be able to fit the pipelines, I first need to group the pipelines into a list:

pipelines = [pipe_lr_wo, pipe_lr_ss, pipe_lr_mms, pipe_lr_rs]

Now we are going to fit the created pipelines:

for pipe in pipelines:
    pipe.fit(trainX, trainY)

4.2.4 Evaluate the Pipeline

for idx, val in enumerate(pipelines):
    print('%s pipeline Test Accuracy: %.2f' % (pipe_dic[idx], accuracy_score(testY, val.predict(testX))))

We can also use the .score function:

for idx, val in enumerate(pipelines):
    print('%s pipeline Test Accuracy: %.2f' % (pipe_dic[idx], val.score(testX, testY)))

I always like to have the results displayed in a dataframe so I can sort and filter:

result = []

for idx, val in enumerate(pipelines):
    result.append(accuracy_score(testY, val.predict(testX)))

    
result_df = pd.DataFrame(list(pipe_dic.items()),columns = ['Idx','Estimator'])

# Add Test Accuracy to result_df
result_df['Test_Accuracy'] = result
# print result_df
result_df 

Let’s take a look at our best model:

best_model = result_df.sort_values(by='Test_Accuracy', ascending=False)
best_model

print(best_model['Estimator'].iloc[0] +
      ' Classifier has the best Test Accuracy of ' + 
      str(round(best_model['Test_Accuracy'].iloc[0], 2)) + '%')

Or the print statement still a little bit spiffed up:

print(Color.RED + best_model['Estimator'].iloc[0] + Color.END +
      ' Classifier has the best Test Accuracy of ' + 
      Color.GREEN + Color.BOLD + str(round(best_model['Test_Accuracy'].iloc[0], 2)) + '%')

4.3 Determination of the best Estimator

Let’s try this time with different estimators to improve the result.

4.3.1 Creation of the Pipeline

pipe_lr = Pipeline([
    ('ss1', StandardScaler()),
    ('lr', LogisticRegression())
    ])

pipe_svm_lin = Pipeline([
    ('ss2', StandardScaler()),
    ('svm_lin', SVC(kernel='linear'))
    ])

pipe_svm_sig = Pipeline([
    ('ss3', StandardScaler()),
    ('svm_sig', SVC(kernel='sigmoid'))
    ])


pipe_knn = Pipeline([
    ('ss4', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors=7))
    ])

pipe_dt = Pipeline([
    ('ss5', StandardScaler()),
    ('dt', DecisionTreeClassifier())
    ])

pipe_rf = Pipeline([
    ('ss6', StandardScaler()),
    ('rf', RandomForestClassifier(n_estimators=100))
    ])

4.3.2 Creation of a Pipeline Dictionary

pipe_dic = {
    0: 'lr',
    1: 'svm_lin',
    2: 'svm_sig',
    3: 'knn',
    4: 'dt',
    5: 'rf'
    }

pipe_dic

4.3.3 Fit the Pipeline

pipelines = [pipe_lr, pipe_svm_lin, pipe_svm_sig, pipe_knn, pipe_dt, pipe_rf]
for pipe in pipelines:
    pipe.fit(trainX, trainY)

4.3.4 Evaluate the Pipeline

for idx, val in enumerate(pipelines):
    print('%s pipeline Test Accuracy: %.2f' % (pipe_dic[idx], accuracy_score(testY, val.predict(testX))))

result = []

for idx, val in enumerate(pipelines):
    result.append(accuracy_score(testY, val.predict(testX)))

    
result_df = pd.DataFrame(list(pipe_dic.items()),columns = ['Idx','Estimator'])

# Add Test Accuracy to result_df
result_df['Test_Accuracy'] = result
# print result_df
result_df 

best_model = result_df.sort_values(by='Test_Accuracy', ascending=False)
best_model

print(Color.RED + best_model['Estimator'].iloc[0] + Color.END +
      ' Classifier has the best Test Accuracy of ' + 
      Color.GREEN + Color.BOLD + str(round(best_model['Test_Accuracy'].iloc[0], 2)) + '%')

Done. The linear support vector classifier has improved the accuracy.

4.4 ML Pipelines with further Components

At this point you can now play wonderfully. You can add different scalers to the Estimators or even try including a PCA.

I use the same pipeline as in the previous example and add a PCA with n_components=2 to the estimators.

pipe_lr = Pipeline([
    ('ss1', StandardScaler()),
    ('pca1', PCA(n_components=2)),
    ('lr', LogisticRegression())
    ])

pipe_svm_lin = Pipeline([
    ('ss2', StandardScaler()),
    ('pca2', PCA(n_components=2)),
    ('svm_lin', SVC(kernel='linear'))
    ])

pipe_svm_sig = Pipeline([
    ('ss3', StandardScaler()),
    ('pca3', PCA(n_components=2)),
    ('svm_sig', SVC(kernel='sigmoid'))
    ])


pipe_knn = Pipeline([
    ('ss4', StandardScaler()),
    ('pca4', PCA(n_components=2)),
    ('knn', KNeighborsClassifier(n_neighbors=7))
    ])

pipe_dt = Pipeline([
    ('ss5', StandardScaler()),
    ('pca5', PCA(n_components=2)),
    ('dt', DecisionTreeClassifier())
    ])

pipe_rf = Pipeline([
    ('ss6', StandardScaler()),
    ('pca6', PCA(n_components=2)),
    ('rf', RandomForestClassifier(n_estimators=100))
    ])
pipe_dic = {
    0: 'lr',
    1: 'svm_lin',
    2: 'svm_sig',
    3: 'knn',
    4: 'dt',
    5: 'rf'
    }
pipelines = [pipe_lr, pipe_svm_lin, pipe_svm_sig, pipe_knn, pipe_dt, pipe_rf]
for pipe in pipelines:
    pipe.fit(trainX, trainY)
for idx, val in enumerate(pipelines):
    print('%s pipeline Test Accuracy: %.2f' % (pipe_dic[idx], accuracy_score(testY, val.predict(testX))))

result = []

for idx, val in enumerate(pipelines):
    result.append(accuracy_score(testY, val.predict(testX)))

    
result_df = pd.DataFrame(list(pipe_dic.items()),columns = ['Idx','Estimator'])

# Add Test Accuracy to result_df
result_df['Test_Accuracy'] = result
# print result_df
result_df 

best_model = result_df.sort_values(by='Test_Accuracy', ascending=False)
best_model

print(Color.RED + best_model['Estimator'].iloc[0] + Color.END +
      ' Classifier has the best Test Accuracy of ' + 
      Color.GREEN + Color.BOLD + str(round(best_model['Test_Accuracy'].iloc[0], 2)) + '%')

The use of a PCA has not worked out any improvement. We can therefore fall back on the linear Support Vector Classifier at this point and try to improve the result again with Fine Tuning.

5 Conclusion

In this post, I showed how to use machine learning pipelines to quickly and efficiently run different scenarios to get a first impression of which algorithm fits my data best.