3 min read

Randomized Search

1 Introduction

In my last publication on “Grid Search” I showed how to do hyper parameter tuning. As you saw in the last chapter (6.3 Grid Search with more than one estimator), these calculations quickly become very computationally intensive. This sometimes leads to very long calculation times. Randomized Search is a cheap alternative to grid search. How Randomized Search work in detail I will show in this publication.

For this post the dataset Breast Cancer Wisconsin (Diagnostic) from the statistic platform “Kaggle” was used. You can download it from my GitHub Repository.

3 Loading the libraries and data

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
import time

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
cancer = pd.read_csv("breast_cancer.csv")

4 Data pre-processing

For this post I use the same data set and the same preparation as for “Grid Search”. I will therefore not go into much detail about the first steps. If you want to learn more about the respective pre-processing steps, please read my “Grid Search - Post”.

vals_to_replace = {'B':'0', 'M':'1'}
cancer['diagnosis'] = cancer['diagnosis'].map(vals_to_replace)
cancer['diagnosis'] = cancer.diagnosis.astype('int64')
x = cancer.drop(['id', 'diagnosis', 'Unnamed: 32'], axis=1)
y = cancer['diagnosis']

trainX, testX, trainY, testY = train_test_split(x, y, test_size = 0.2)

So let’s do our first prediction with a “Support Vector Machine”:

clf = SVC(kernel='linear')
clf.fit(trainX, trainY)

y_pred = clf.predict(testX)

print('Accuracy: {:.2f}'.format(accuracy_score(testY, y_pred)))

5 Grid Searach

Now we are going to do hyperparameter tuning with grid search. We also measure the time how long this tuning takes.

param_grid = {'C': [0.1, 1, 10, 100, 1000],  
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001], 
              'kernel': ['linear']} 
start = time.time()

grid = GridSearchCV(SVC(), param_grid, cv = 5, scoring='accuracy')
grid.fit(trainX, trainY)

end = time.time()
print()
print('Calculation time: ' + str(round(end - start,2)) + ' seconds')

print(grid.best_params_) 

grid_predictions = grid.predict(testX) 

print('Accuracy: {:.2f}'.format(accuracy_score(testY, grid_predictions)))

7 Conclusion

As we can see, Randomized Search took less than half the time it took to search with Grid Search. A different value was found for gamma, but the prediction accuracy remained the same (95%).