1 Introduction
In my last publication on “Grid Search” I showed how to do hyper parameter tuning. As you saw in the last chapter (6.3 Grid Search with more than one estimator), these calculations quickly become very computationally intensive. This sometimes leads to very long calculation times. Randomized Search is a cheap alternative to grid search. How Randomized Search work in detail I will show in this publication.
For this post the dataset Breast Cancer Wisconsin (Diagnostic) from the statistic platform “Kaggle” was used. You can download it from my GitHub Repository.
2 Grid Search vs. Randomized Search
First of all, let’s clarify the difference between Grid Search and Randomized Search.
Grid Search can be thought of as an exhaustive search for selecting a machine learning model. With Grid Search, the data scientist/analyst sets up a grid of hyperparameter values and for each combination, trains a model and scores on the testing data. In this approach, every combination of hyperparameter values is tried. This could be very inefficient and computationally intensive.
By contrast, Randomized Search sets up a grid of hyperparameter values and selects random combinations to train the model and score. This allows you to explicitly control the number of parameter combinations that are attempted. The number of search iterations is set based on time or resources. While it’s possible that Randomized Search will not find as accurate of a result as Grid Search, it surprisingly picks the best result more often than not and in a fraction of the time it takes Grid Search would have taken. Given the same resources, Randomized Search can even outperform Grid Search.
3 Loading the libraries and data
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import time
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
cancer = pd.read_csv("breast_cancer.csv")
4 Data pre-processing
For this post I use the same data set and the same preparation as for “Grid Search”. I will therefore not go into much detail about the first steps. If you want to learn more about the respective pre-processing steps, please read my “Grid Search - Post”.
vals_to_replace = {'B':'0', 'M':'1'}
cancer['diagnosis'] = cancer['diagnosis'].map(vals_to_replace)
cancer['diagnosis'] = cancer.diagnosis.astype('int64')
x = cancer.drop(['id', 'diagnosis', 'Unnamed: 32'], axis=1)
y = cancer['diagnosis']
trainX, testX, trainY, testY = train_test_split(x, y, test_size = 0.2)
So let’s do our first prediction with a “Support Vector Machine”:
clf = SVC(kernel='linear')
clf.fit(trainX, trainY)
y_pred = clf.predict(testX)
print('Accuracy: {:.2f}'.format(accuracy_score(testY, y_pred)))
5 Grid Searach
Now we are going to do hyperparameter tuning with grid search. We also measure the time how long this tuning takes.
param_grid = {'C': [0.1, 1, 10, 100, 1000],
'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
'kernel': ['linear']}
start = time.time()
grid = GridSearchCV(SVC(), param_grid, cv = 5, scoring='accuracy')
grid.fit(trainX, trainY)
end = time.time()
print()
print('Calculation time: ' + str(round(end - start,2)) + ' seconds')
print(grid.best_params_)
grid_predictions = grid.predict(testX)
print('Accuracy: {:.2f}'.format(accuracy_score(testY, grid_predictions)))
6 Randomized Search
Now we do the same hyperparameter tuning with Randomized Search.
param_rand_search = {'C': [0.1, 1, 10, 100, 1000],
'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
'kernel': ['linear']}
start = time.time()
rand_search = RandomizedSearchCV(SVC(), param_rand_search, cv = 5, scoring='accuracy')
rand_search.fit(trainX, trainY)
end = time.time()
print()
print('Calculation time: ' + str(round(end - start,2)) + ' seconds')
print(rand_search.best_params_)
grand_search_predictions = rand_search.predict(testX)
print('Accuracy: {:.2f}'.format(accuracy_score(testY, grand_search_predictions)))
7 Conclusion
As we can see, Randomized Search took less than half the time it took to search with Grid Search. A different value was found for gamma, but the prediction accuracy remained the same (95%).