4 min read

PCA for Visualization

1 Introduction

After I wrote extensively on the subject of “Principal Component Analysis” in my last publication, we now come to one of the two main uses announced: PCA for visualizations.

For this post the dataset Pokemon from the statistic platform “Kaggle” was used. You can download it from my “GitHub Repository”.

2 Loading the libraries and the dataset

import numpy as np
import pandas as pd

import math

import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler
pokemon = pd.read_csv("pokemon.csv")
pokemon.head()

3 Statistics and preprocessing

df = pokemon[['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']]

df.describe()

col_names = df.columns
features = df[col_names]

scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)
df_scaled = pd.DataFrame(features, columns = col_names)
df_scaled.head()

4 PCA for visualization

First of all, we calculate the first two main components using the PCA. If one of the following steps is not clear or is insufficiently described, read “this” post from me.

pca = PCA(n_components=2, svd_solver='full')

pca.fit(df_scaled)
T = pca.transform(df_scaled)
print('Here we can see that we started with 6 dimensions:')
print(df_scaled.shape)
print('')
print('After PCA we have only 2:')
print(T.shape)

pca.explained_variance_ratio_

4.1 Interpreting Components

These are the two main components calculated:

components = pd.DataFrame(pca.components_, columns = df_scaled.columns, index=[1, 2])
components

Personally, I prefer to read these in the following format:

components = components.T

components.columns = ['Principle_Component_1', 'Principle_Component_2']
components

Component 1

pc1 = components[['Principle_Component_1']].sort_values(by='Principle_Component_1', ascending=False)
pc1

So for the first principle component, Sp. Attack and Sp. Defence is significant so this principle component is correlated well with Sp. Atk and Sp. Def and pokemon with a high value for the first principle component have high Sp. Atk and Sp. Def.

Component 2

pc2 = components[['Principle_Component_2']].sort_values(by='Principle_Component_2', ascending=False)
pc2

Be careful at this point. Some high values are in the minus range and are therefore only listed at the end of the table. We therefore have to convert all values into absolute values and then sort them in descending order.

pc2['positive_values'] = abs(pc2.Principle_Component_2)
pc2

pc2.sort_values(by='positive_values', ascending=False)

For the second principle component, this will increase with an decrease in Speed and a increase in Defence. Pokemon with high values of the second principle component will have a high value for Defense but a low value for Speed.

4.2 Visualization of the components

def get_important_features(transformed_features, components_, columns):
    """
    This function will return the most "important" 
    features so we can determine which have the most
    effect on multi-dimensional scaling
    """
    num_columns = len(columns)

    # Scale the principal components by the max value in
    # the transformed set belonging to that component
    xvector = components_[0] * max(transformed_features[:,0])
    yvector = components_[1] * max(transformed_features[:,1])

    # Sort each column by it's length. These are your *original*
    # columns, not the principal components.
    important_features = { columns[i] : math.sqrt(xvector[i]**2 + yvector[i]**2) for i in range(num_columns) }
    important_features = sorted(zip(important_features.values(), important_features.keys()), reverse=True)
    return important_features
# application of the function
important_features = get_important_features(T, pca.components_, df.columns.values)

# convert output to pd.dataframe
important_features = pd.DataFrame(important_features, columns =['Value', 'Feature'])
# change order of dataframe columns
cols = ['Feature', 'Value']
important_features = important_features[cols]

#print the output
important_features

def draw_vectors(transformed_features, components_, columns):
    """
    This funtion will project your *original* features
    onto your principal component feature-space, so that you can
    visualize how "important" each one was in the
    multi-dimensional scaling
    """

    num_columns = len(columns)

    # Scale the principal components by the max value in
    # the transformed set belonging to that component
    xvector = components_[0] * max(transformed_features[:,0])
    yvector = components_[1] * max(transformed_features[:,1])

    ax = plt.axes()

    for i in range(num_columns):
    # Use an arrow to project each original feature as a
    # labeled vector on your principal component axes
        plt.arrow(0, 0, xvector[i], yvector[i], color='b', width=0.0005, head_width=0.02, alpha=0.75)
        plt.text(xvector[i]*1.2, yvector[i]*1.2, list(columns)[i], color='b', alpha=0.75)

    return ax
ax = draw_vectors(T, pca.components_, df.columns.values)
T_df = pd.DataFrame(T)
T_df.columns = ['component1', 'component2']

T_df['color'] = 'y'
T_df.loc[T_df['component1'] > 4, 'color'] = 'g'
T_df.loc[T_df['component2'] > 3, 'color'] = 'r'

plt.xlabel('Principle Component 1')
plt.ylabel('Principle Component 2')
plt.scatter(T_df['component1'], T_df['component2'], color=T_df['color'], alpha=0.5)
plt.show()

Get the pokemons which load high on the first principal component (High Sp. Atk, High Sp. Def):

pc1_pokemon = pokemon.loc[T_df[T_df['color'] == 'g'].index]
pc1_pokemon

Get the pokemons which load high on the second principal component (High Defense, Low Speed):

pc2_pokemon = pokemon.loc[T_df[T_df['color'] == 'r'].index]
pc2_pokemon

5 Conclusion

In this post, I showed how a PCA can be used sensibly to visualize complex data and extract valuable insights from it.