6 min read

Principal Component Analysis (PCA)

1 Introduction

After the various methods of cluster analysis “Cluster Analysis” have been presented in various publications, we now come to the second category in the area of unsupervised machine learning: Dimensionality Reduction

The areas of application of dimensionality reduction are widely spread within machine learning. Here are some applications of Dimensionality Reduction:

  • Pre-processing / Feature engineering
  • Noise reduction
  • Generating plausible artificial datasets
  • Financial modelling / risk analysis

For this post the datasets Auto-mpg and MNIST from the statistic platform “Kaggle” were used. A copy of the records is available at my “GitHub Repository” and here: https://drive.google.com/open?id=1Bfquk0uKnh6B3Yjh2N87qh0QcmLokrVk.

2 Loading the libraries

import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Chapter 4&5
from sklearn.decomposition import PCA

# Chapter 6
from sklearn.decomposition import IncrementalPCA

# Chapter 7
from sklearn.decomposition import KernelPCA
from sklearn.datasets import make_swiss_roll

from mpl_toolkits.mplot3d import Axes3D
from pydiffmap import diffusion_map as dm
from pydiffmap.visualization import data_plot

# Chapter 8
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

3 Introducing PCA

PCA is a commonly used and very effective dimensionality reduction technique, which often forms a pre-processing stage for a number of machine learning models and techniques.

In a nutshell: PCA reduces the sparsity in the dataset by separating the data into a series of components where each component represents a source of information within the data.

As its name suggests, the first principal component produced in PCA, comprises the majority of information or variance within the data. With each subsequent component, less information, but more subtlety, is contributed to the compressed data.

This post is intended to serve as an introduction to PCA in general. In two further publications, the two main uses of the PCA:

  • PCA for visualization
  • PCA for speed up machine learning algorithms

are to be presented separately in detail.

4 PCA in general

For the demonstration of PCA in general we’ll load the cars dataset:

cars = pd.read_csv('auto-mpg.csv')
cars["horsepower"] = pd.to_numeric(cars.horsepower, errors='coerce')
cars

For further use, we throw out all missing values from the data set, split the data set and standardize all predictors.

cars=cars.dropna()
X = cars.drop(['car name'], axis=1)
Y = cars['car name']
sc = StandardScaler()

X = X.values
X_std =  sc.fit_transform(X)  

With the following calculations we can see how much variance the individual main components explain within the data set.

cov_matrix = np.cov(X_std.T)
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
tot = sum(eigenvalues)
var_explained = [(i / tot) for i in sorted(eigenvalues, reverse=True)] 
cum_var_exp = np.cumsum(var_explained)
plt.bar(range(1,len(var_explained)+1), var_explained, alpha=0.5, align='center', label='individual explained variance')
plt.step(range(1,len(var_explained)+1),cum_var_exp, where= 'mid', label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc = 'best')
plt.show()

As we can see, it is worth using the first two main components, as together they already explain 80% of the variance.

pca = PCA(n_components = 2)
pca.fit(X_std)
x_pca = pca.transform(X_std)
pca.explained_variance_ratio_

For those who are too lazy to add that up in their heads:

pca.explained_variance_ratio_.sum()

In the previous step, we specified how many main components the PCA should calculated and then asked how much variance these components explained.

We can also approach this process the other way around and tell the PCA how much variance we would like to have explained.

We do this so (for 95% variance):

pca = PCA(n_components = 0.95)
pca.fit(X_std)
x_pca = pca.transform(X_std)
pca.n_components_

4 main components were necessary to achieve 95% variance explanation.

5 Randomized PCA

PCA is mostly used for very large data sets with many variables in order to make them clearer and easier to interpret. This can lead to a very high computing power and long waiting times. Randomized PCA can be used to reduce the calculation time. To do this, simply set the parameter svd_solver to ‘randomized’. In the following example you can see the saving in computing time. This example is carried out with the MNIST data set.

mnist = pd.read_csv('mnist_train.csv')
X = mnist.drop(['label'], axis=1)
sc = StandardScaler()
X = X.values
X_std =  sc.fit_transform(X)
print(X_std.shape)

import time

start = time.time()

pca = PCA(n_components = 200)
pca.fit(X_std)
x_pca = pca.transform(X_std)


end = time.time()
print()
print('Calculation time: ' + str(end - start) + ' seconds')

pca_time = end - start
import time

start = time.time()

rnd_pca = PCA(n_components = 200, svd_solver='randomized')
rnd_pca.fit(X_std)
x_rnd_pca = rnd_pca.transform(X_std)

end = time.time()
print()
print('Calculation time: ' + str(end - start) + ' seconds')

rnd_pca_time = end - start
diff = pca_time - rnd_pca_time
diff

procentual_decrease = ((pca_time - rnd_pca_time) / pca_time) * 100
print('Procentual decrease of: ' + str(round(procentual_decrease, 2)) + '%')

6 Incremental PCA

In some cases the data set may be too large to be able to perform a principal component analysis all at once. The Incremental PCA is available for these cases:

With n_batches we determine how much data should always be loaded at once.

n_batches = 100

inc_pca = IncrementalPCA(n_components = 100)
for X_batch in np.array_split(X_std, n_batches):
    inc_pca.partial_fit(X_batch)
inc_pca.explained_variance_ratio_.sum()

7 Kernel PCA

Now we know all too well from practice that some data cannot be linearly separable. Like this one for example:

X, color = make_swiss_roll(n_samples = 1000)
swiss_roll = X
swiss_roll.shape

plt.scatter(swiss_roll[:, 0], swiss_roll[:, 1], c=color)

Not really nice to look at. But we can do better.

neighbor_params = {'n_jobs': -1, 'algorithm': 'ball_tree'}

mydmap = dm.DiffusionMap.from_sklearn(n_evecs=2, k=200, epsilon='bgh', alpha=1.0, neighbor_params=neighbor_params)
# fit to data and return the diffusion map.
dmap = mydmap.fit_transform(swiss_roll)

data_plot(mydmap, dim=3, scatter_kwargs = {'cmap': 'Spectral'})
plt.show()

Let’s use some PCA models with different kernels and have a look at how they will perform.

linear_pca = KernelPCA(n_components = 2, kernel='linear')

linear_pca.fit(swiss_roll)

X_reduced_linear = linear_pca.transform(swiss_roll)

X_reduced_linear.shape

plt.scatter(X_reduced_linear[:, 0], X_reduced_linear[:, 1], c = color, cmap=plt.cm.Spectral)
plt.title("Linear kernel")

rbf_pca = KernelPCA(n_components = 2, kernel='rbf', gamma=0.04)

rbf_pca.fit(swiss_roll)

X_reduced_rbf = rbf_pca.transform(swiss_roll)

X_reduced_rbf.shape

plt.scatter(X_reduced_rbf[:, 0], X_reduced_rbf[:, 1], c = color, cmap=plt.cm.Spectral)
plt.title("RBF kernel, gamma: 0.04")

sigmoid_pca = KernelPCA(n_components = 2, kernel='sigmoid', gamma=0.001)

sigmoid_pca.fit(swiss_roll)

X_reduced_sigmoid = sigmoid_pca.transform(swiss_roll)

X_reduced_sigmoid.shape

plt.scatter(X_reduced_sigmoid[:, 0], X_reduced_sigmoid[:, 1], c = color, cmap=plt.cm.Spectral)
plt.title("RBF kernel, gamma: 0.04")

8 Tuning Hyperparameters

For this chapter let’s create some new test data:

Xmoon, ymoon = make_moons(500, noise=.06)
plt.scatter(Xmoon[:, 0], Xmoon[:, 1])

If we use a kernel PCA as a preprocessing step in order to train a machine learning algorithm, the most suitable kernel can also be calculated with the help of gridsearch.

x = Xmoon
y = ymoon

trainX, testX, trainY, testY = train_test_split(x, y, test_size = 0.2)
clf = Pipeline([
    ("kpca", KernelPCA(n_components=2)),
    ("log_reg", LogisticRegression())
    ])
param_grid = [{
    "kpca__gamma": np.linspace(0.03, 0.05, 10),
    "kpca__kernel": ["linear", "rbf", "sigmoid"]
    }]
grid_search = GridSearchCV(clf, param_grid, cv=3)

grid_search.fit(trainX, trainY)

Here are the best parameters to use:

print(grid_search.best_params_)

9 Conclusion

In this post I have explained what the PCA is in general and how to use it. I also presented different types of PCAs for different situations.