1 Introduction
As announced in my last post, I will now create a neural network using a Deep Learning library (Keras in this case) to solve binary classification problems.
For this publication the dataset Winequality from the statistic platform “Kaggle” was used. You can download it from my “GitHub Repository”.
2 Loading the libraries
import pandas as pd
import os
import shutil
import pickle as pk
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from keras import models
from keras import layers
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.models import load_model
3 Loading the data
df = pd.read_csv('winequality.csv').dropna()
df
df['type'].value_counts()
4 Data pre-processing
4.1 Determination of the predictors and the criterion
x = df.drop('type', axis=1)
y = df['type']
4.2 Encoding
Since all variables must be numeric, we must recode the criterion at this point. For this I used the LabelEncoder. How to use it can be read in the following post: Types of Encoder
encoder = LabelEncoder()
encoded_Y = encoder.fit_transform(y)
encoded_Y
4.3 Train-Validation-Test Split
As already known from the computer vision posts, for neural networks we need to split our dataset into a training part, a validation part and a testing part. In the following, I will randomly assign 70% of the data to the training part and 15% each to the validation and test part.
train_ratio = 0.70
validation_ratio = 0.15
test_ratio = 0.15
# Generate TrainX and TrainY
trainX, testX, trainY, testY = train_test_split(x, encoded_Y, test_size= 1 - train_ratio)
# Genearate ValX, TestX, ValY and TestY
valX, testX, valY, testY = train_test_split(testX, testY, test_size=test_ratio/(test_ratio + validation_ratio))
print(trainX.shape)
print(valX.shape)
print(testX.shape)
5 ANN for binary Classification
My approach to using neural networks with Keras is described in detail in my post Computer Vision - Convolutional Neural Network and can be read there if something is unclear.
5.1 Name Definitions
checkpoint_no = 'ckpt_1_ANN'
model_name = 'Wine_ANN_2FC_F16_16_epoch_25'
5.2 Parameter Settings
input_shape = trainX.shape[1]
n_batch_size = 100
n_steps_per_epoch = int(trainX.shape[0] / n_batch_size)
n_validation_steps = int(valX.shape[0] / n_batch_size)
n_test_steps = int(testX.shape[0] / n_batch_size)
n_epochs = 25
print('Input Shape: ' + str(input_shape))
print('Batch Size: ' + str(n_batch_size))
print()
print('Steps per Epoch: ' + str(n_steps_per_epoch))
print()
print('Validation Steps: ' + str(n_validation_steps))
print('Test Steps: ' + str(n_test_steps))
print()
print('Number of Epochs: ' + str(n_epochs))
5.3 Layer Structure
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(input_shape,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.summary()
5.4 Configuring the model for training
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
5.5 Callbacks
If you want to know more about callbacks you can read about it here at Keras or also in my post about Convolutional Neural Networks.
# Prepare a directory to store all the checkpoints.
checkpoint_dir = './'+ checkpoint_no
if not os.path.exists(checkpoint_dir):
os.makedirs(checkpoint_dir)
keras_callbacks = [ModelCheckpoint(filepath = checkpoint_dir + '/' + model_name,
monitor='val_loss', save_best_only=True, mode='auto')]
5.6 Fitting the model
history = model.fit(trainX,
trainY,
steps_per_epoch=n_steps_per_epoch,
epochs=n_epochs,
batch_size=n_batch_size,
validation_data=(valX, valY),
validation_steps=n_validation_steps,
callbacks=[keras_callbacks])
5.7 Obtaining the best model values
hist_df = pd.DataFrame(history.history)
hist_df['epoch'] = hist_df.index + 1
cols = list(hist_df.columns)
cols = [cols[-1]] + cols[:-1]
hist_df = hist_df[cols]
hist_df.to_csv(checkpoint_no + '/' + 'history_df_' + model_name + '.csv')
hist_df.head()
values_of_best_model = hist_df[hist_df.val_loss == hist_df.val_loss.min()]
values_of_best_model
5.8 Obtaining class assignments
Similar to the neural networks for computer vision, I also save the class assignments for later reuse.
class_assignment = dict(zip(y, encoded_Y))
df_temp = pd.DataFrame([class_assignment], columns=class_assignment.keys())
df_temp = df_temp.stack()
df_temp = pd.DataFrame(df_temp).reset_index().drop(['level_0'], axis=1)
df_temp.columns = ['Category', 'Allocated Number']
df_temp.to_csv(checkpoint_no + '/' + 'class_assignment_df_' + model_name + '.csv')
print('Class assignment:', str(class_assignment))
The encoder used is also saved.
pk.dump(encoder, open(checkpoint_no + '/' + 'encoder.pkl', 'wb'))
5.9 Validation
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()
5.10 Load best model
Again, reference to the Computer Vision posts where I explained why and how I cleaned up the Model Checkpoint folders.
# Loading the automatically saved model
model_reloaded = load_model(checkpoint_no + '/' + model_name)
# Saving the best model in the correct path and format
root_directory = os.getcwd()
checkpoint_dir = os.path.join(root_directory, checkpoint_no)
model_name_temp = os.path.join(checkpoint_dir, model_name + '.h5')
model_reloaded.save(model_name_temp)
# Deletion of the automatically created folder under Model Checkpoint File.
folder_name_temp = os.path.join(checkpoint_dir, model_name)
shutil.rmtree(folder_name_temp, ignore_errors=True)
best_model = load_model(model_name_temp)
The overall folder structure should look like this:
5.11 Model Testing
test_loss, test_acc = best_model.evaluate(testX,
testY,
steps=n_test_steps)
print()
print('Test Accuracy:', test_acc)
5.12 Predictions
y_pred = model.predict(testX)
y_pred
6 Prevent Overfitting
Often you have the problem of overfitting. For this reason, I have presented here a few approaches on how to counteract overfitting.
6.1 Original Layer Structure
Here again, as a reminder, the used layer structure:
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(input_shape,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
6.2 Reduce the network’s size
The first thing I always try to do is to change something in the layer structure. To counteract overfitting, it is often advisable to reduce the layer structure. Using our example, I would try the following new layer structure if overfitting existed.
model = models.Sequential()
model.add(layers.Dense(4, activation='relu', input_shape=(input_shape,)))
model.add(layers.Dense(4, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
6.3 Adding weight regularization
Another option is Weight Regularization:
from keras import regularizers
model = models.Sequential()
model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001),
activation='relu', input_shape=(input_shape,)))
model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001),
activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
6.4 Adding dropout
As I used to do with Computer Vision, adding dropout layers is also a very useful option.
An example layer structure in our case would look like this:
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(input_shape,)))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(1, activation='sigmoid'))
7 Conclusion
Lastly, I would like to mention a few points regarding this post. It was not relevant for this dataset but in case it was (and with real world data this is mostly the case) further metrics should be stored:
- Mean values of the individual predictors in order to be able to compensate for missing values later on.
- Further encoders for predictors, if categorical features are converted.
- Scaler, if these are used.
- If variables would have been excluded, a list with the final features should have been stored.
For what reason I give these recommendations can be well read in my Data Science Post. Here I have also created best practice guidelines on how to proceed with model training.
I would like to add one limitation at this point. You may have noticed it already, but the dataset was heavily imbalanced. How to deal with such problems I have explained here: Dealing with imbalanced classes
References
The content of the entire post was created using the following sources:
Chollet, F. (2018). Deep learning with Python (Vol. 361). New York: Manning.