1 Introduction
2 Import the libraries
3 Descriptive Analytics (Conversion Rate)
4 Drivers behind Marketing Engagement
5 Predicting the Likelihood of Marketing Engagement
6 Engagement to Conversion
7 Conclusion

1 Introduction

After having reported very detailed in numerous posts about the different machine learning areas I will now work on various analytics fields.

I start with Marketing Analytics.

To be precise, the analysis of conversion rates, their influencing factors and how machine learning algorithms can be used to generate valuable insights from this kind of data.

In this post I will use the data set ‘bank-additional-full’ and ‘WA_Fn-UseC_-Marketing-Customer-Value-Analysis’. Both are from the website “UCI Machine Learning Repository”. You can also download them from my “GitHub Repository”.

2 Import the libraries

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import OneHotEncoder

import statsmodels.api as sm

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve, auc

from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree

3 Descriptive Analytics (Conversion Rate)

Definition Conversion Rate:

The conversion rate describes the ratio of visits/clicks to conversions achieved. Conversions are conversions from prospects to customers or buyers. They can for example consist of purchases or downloads.

df = pd.read_csv('bank-additional-full.csv', sep=';')
df.head()

'''
In the following the column y is coded.
Then the newly generated values are inserted into the original dataframe. 
The old column is still retained in this case.
'''

encoder_y = LabelBinarizer()

# Application of the LabelBinarizer
y_encoded = encoder_y.fit_transform(df.y.values.reshape(-1,1))

# Insertion of the coded values into the original data set
df['conversion'] = y_encoded

# Getting the exact coding and show new dataframe
print(encoder_y.classes_)
print('Codierung: no=0, yes=1')
print('-----------------------------')
print()
print('New Data Frame:')
df.head()

'''
Absolut conversions vs. conversion rate
'''

print('Conversions (absolut): %i out of %i' % (df.conversion.sum(), df.shape[0]))

print('Conversion Rate: %0.2f%%' % (df.conversion.sum() / df.shape[0] * 100.0))

Age

'''
Calculate the conversion rate by age
'''

conversion_rate_by_age = df.groupby(by='age')['conversion'].sum() / df.groupby(by='age')['conversion'].count() * 100.0
pd.DataFrame(conversion_rate_by_age.reset_index().rename(columns={'conversion':'conversion_%'})).head()

ax = conversion_rate_by_age.plot(
    grid=True,
    figsize=(10, 7),
    title='Conversion Rates by Age')

ax.set_xlabel('age')
ax.set_ylabel('conversion rate (%)')

plt.show()

def age_group_function(df):

    if (df['age'] >= 70):
        return '70<'
    
    elif (df['age'] < 70) and (df['age'] >= 60):
        return '[60, 70]'
              
    elif (df['age'] <= 60) and (df['age'] >= 50):
        return '[50, 60]'

    elif (df['age'] <= 50) and (df['age'] >= 40):
        return '[40, 50]'
    
    elif (df['age'] <= 40) and (df['age'] >= 30):
        return '[30, 40]'
    
    elif (df['age'] <= 30) and (df['age'] >= 20):
        return '[20, 30]'
    
    elif (df['age'] < 20):
        return '<20'    
    
df['age_group'] = df.apply(age_group_function, axis = 1)
df.head()

'''
Calculate the conversion rate by age_group
'''

conversion_rate_by_age_group = df.groupby(by='age_group')['conversion'].sum() / df.groupby(by='age_group')['conversion'].count() * 100.0
pd.DataFrame(conversion_rate_by_age_group.reset_index().rename(columns={'conversion':'conversion_%'}))

ax = conversion_rate_by_age_group.loc[['<20', '[20, 30]', '[30, 40]', '[40, 50]', '[50, 60]', '[60, 70]', '70<']].plot(
    kind='bar',
    color='skyblue',
    grid=True,
    figsize=(10, 7),
    title='Conversion Rates by Age Groups')

ax.set_xlabel('age_group')
ax.set_ylabel('conversion rate (%)')

plt.show()

Marital Status

conversions_by_marital_status = pd.pivot_table(df, values='y', index='marital', columns='conversion', aggfunc=len)
conversions_by_marital_status.columns = ['non_conversions', 'conversions']
conversions_by_marital_status

conversions_by_marital_status.plot(
    kind='pie',
    figsize=(15, 7),
    startangle=90,
    subplots=True,
    autopct=lambda x: '%0.1f%%' % x)

plt.show()

Age Groups and Marital Status

age_marital = df.groupby(['age_group', 'marital'])['conversion'].sum().unstack('marital').fillna(0)
age_marital

age_marital = age_marital.divide(
    df.groupby(
    by='age_group'
    )['conversion'].count(), 
    axis=0)

age_marital

ax = age_marital.loc[
    ['<20', '[20, 30]', '[30, 40]', '[40, 50]', '[50, 60]', '[60, 70]', '70<']].plot(
    kind='bar', 
    stacked=True,
    grid=True,
    figsize=(10,7))

ax.set_title('Conversion rates by Age & Marital Status')
ax.set_xlabel('age group')
ax.set_ylabel('conversion rate (%)')

plt.show()

4 Drivers behind Marketing Engagement

Definiton Marketing Engagement:

In marketing engagement, the aim is to involve the customer in the marketing measures and thus encourage him to actively interact with the content. This should generate a positive experience and a positive association with the brand and the company, thus strengthening the customer’s loyalty to the company. This can lead to identification with the company and its values and can ultimately increase the chance of conversions.

df = pd.read_csv('WA_Fn-UseC_-Marketing-Customer-Value-Analysis.csv')
df['Engaged'] = df['Response'].apply(lambda x: 0 if x == 'No' else 1)
df.head().T

4.1 Select Numerical Columns

num_col = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numerical_columns = list(df.select_dtypes(include=num_col).columns)
df_numeric = df[numerical_columns]
df_numeric.dtypes

4.2 Select and Encode Categorical Columns

obj_col = ['object']
object_columns = list(df.select_dtypes(include=obj_col).columns)
df_categorical = df[object_columns]
df_categorical.dtypes

We just take 3 of the cat variables otherwise this step would take too long and this is just an example of how to handle cat variables.

df_categorical = df_categorical[['State', 'Education', 'Gender']]
df_categorical.head()

print('Values of the variable State:')
print()
print(df_categorical['State'].value_counts())

print('--------------------------------------------')

print('Values of the variable Education:')
print()
print(df_categorical['Education'].value_counts())

print('--------------------------------------------')

print('Values of the variable EmploymentStatus:')
print()
print(df_categorical['Gender'].value_counts())

Here we have 3 different kind of categorical variables.

State: nominal
Education: ordinal
Gender: binary

In the following the column State is coded. Then the newly generated values are inserted into the original dataframe and the old column will be deleted.

encoder_State = OneHotEncoder()

# Application of the OneHotEncoder
OHE = encoder_State.fit_transform(df_categorical.State.values.reshape(-1,1)).toarray()

# Conversion of the newly generated data to a dataframe
df_OHE = pd.DataFrame(OHE, columns = ["State_" + str(encoder_State.categories_[0][i]) 
                                     for i in range(len(encoder_State.categories_[0]))])




# Insertion of the coded values into the original data set
df_categorical = pd.concat([df_categorical, df_OHE], axis=1)


# Delete the original column to avoid duplication
df_categorical = df_categorical.drop(['State'], axis=1)

In the following the column Education is coded. Then the newly generated values are inserted into the original dataframe and the old column will be deleted.

# Create a dictionary how the observations should be coded
education_dict = {'High School or Below' : 0,
                  'College' : 1,
                  'Bachelor' : 2,
                  'Master' : 3,
                  'Doctor' : 4}

# Map the dictionary on the column view and store the results in a new column
df_categorical['Education_encoded'] = df_categorical.Education.map(education_dict)

# Delete the original column to avoid duplication
df_categorical = df_categorical.drop(['Education'], axis=1)

In the following the column Gender is coded. Then the newly generated values are inserted into the original dataframe and the old column will be deleted.

encoder_Gender = LabelBinarizer()

# Application of the LabelBinarizer
Gender_encoded = encoder_Gender.fit_transform(df_categorical.Gender.values.reshape(-1,1))

# Insertion of the coded values into the original data set
df_categorical['Gender_encoded'] = Gender_encoded

# Delete the original column to avoid duplication
df_categorical = df_categorical.drop(['Gender'], axis=1)

4.3 Create final Dataframe

df_final = pd.concat([df_numeric, df_categorical], axis=1)
df_final.head()

4.4 Regression Analysis (Logit)

If we work with the sm library, we have to add a constant to the predictor(s). With the Statsmodels Formula library, this would not have been necessary manually, but the disadvantage of this variant is that we have to enumerate the predictors individually in the formula.

x = sm.add_constant(df_final.drop('Engaged', axis=1))

y = df_final['Engaged']

logit = sm.Logit(y,x)

logit_fit = logit.fit()

logit_fit.summary()

2 variables are significant (Education_encoded and Total Claim Amount). Both with a positive relationship to the target variable Engaged.

This means (in the case of the variable Education_encoded), the higher the education the more the customer will be receptive to marketing calls.

5 Predicting the Likelihood of Marketing Engagement

Here we can again use the previously created data set (df_final). Note at this point: We have not included all categorical variables. The reason for this was that the correct coding was not done for all variables for reasons of overview/time.

# Replacement of all whitespaces within the column names 
df_final.columns = [x.replace(' ', '_') for x in df_final.columns]
df_final

x = df_final.drop(['Engaged'], axis=1)
y = df_final['Engaged']

trainX, testX, trainY, testY = train_test_split(x, y, test_size = 0.2)

5.1 Fit the Model

rf_model = RandomForestClassifier(n_estimators=200, max_depth=5)
rf_model.fit(trainX, trainY)

5.2 Feature Importance

feat_imps = pd.DataFrame({'importance': rf_model.feature_importances_}, index=trainX.columns)
feat_imps.sort_values(by='importance', ascending=False, inplace=True)
feat_imps

feat_imps.plot(kind='bar', figsize=(10,7))

plt.legend()
plt.show()

5.3 Model Evaluation

Accuracy

rf_preds_train = rf_model.predict(trainX)
rf_preds_test = rf_model.predict(testX)

print('Random Forest Classifier:\n> Accuracy on training data = {:.4f}\n> Accuracy on test data = {:.4f}'.format(
    accuracy_score(trainY, rf_preds_train),
    accuracy_score(testY, rf_preds_test)
))

ROC & AUC

rf_preds_train = rf_model.predict_proba(trainX)[:,1]
rf_preds_test = rf_model.predict_proba(testX)[:,1]

train_fpr, train_tpr, train_thresholds = roc_curve(trainY, rf_preds_train)
test_fpr, test_tpr, test_thresholds = roc_curve(testY, rf_preds_test)

train_roc_auc = auc(train_fpr, train_tpr)
test_roc_auc = auc(test_fpr, test_tpr)

print('Train AUC: %0.4f' % train_roc_auc)
print('Test AUC: %0.4f' % test_roc_auc)

plt.figure(figsize=(10,7))

plt.plot(test_fpr, test_tpr, color='darkorange', label='Test ROC curve (area = %0.4f)' % test_roc_auc)
plt.plot(train_fpr, train_tpr, color='navy', label='Train ROC curve (area = %0.4f)' % train_roc_auc)
plt.plot([0, 1], [0, 1], color='gray', lw=1, linestyle='--')
plt.grid()
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('RandomForest Model ROC Curve')
plt.legend(loc="lower right")

plt.show()

6 Engagement to Conversion

Now that we have examined the conversion rate by means of descriptive statistics, have determined the influencing factors of engagement and can also predict these by means of a machine learning model, it is now time to extract further insights, such as a target group determination, from the data to the conversion rate.

df = pd.read_csv('bank-additional-full.csv', sep=';')
df.head()

num_col = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numerical_columns = list(df.select_dtypes(include=num_col).columns)
df_numeric = df[numerical_columns]
df_numeric = df_numeric[['age', 'campaign']]
df_numeric.dtypes

obj_col = ['object']
object_columns = list(df.select_dtypes(include=obj_col).columns)
df_categorical = df[object_columns]
df_categorical = df_categorical[['job', 'marital', 'y']]
df_categorical.dtypes

dummy_job = pd.get_dummies(df_categorical['job'], prefix="job")
column_name = df_categorical.columns.values.tolist()
column_name.remove('job')
df_categorical = df_categorical[column_name].join(dummy_job)

dummy_marital = pd.get_dummies(df_categorical['marital'], prefix="marital")
column_name = df_categorical.columns.values.tolist()
column_name.remove('marital')
df_categorical = df_categorical[column_name].join(dummy_marital)


df_categorical.head()

df_final = pd.concat([df_categorical, df_numeric], axis=1)
df_final.head()

x = df_final.drop(['y'], axis=1)
y = df_final['y']

clf_dt = DecisionTreeClassifier()
clf_dt.fit(x, y)

features = x.columns.tolist()
classes = y.unique().tolist()

plt.figure(figsize=(15, 15))
plot_tree(clf_dt, feature_names=features, class_names=classes, filled=True)
plt.savefig('tree.png')
plt.show()

Not yet really readable / interpretable.

clf = DecisionTreeClassifier(max_depth=4)
clf.fit(x, y)

features = x.columns.tolist()
classes = y.unique().tolist()

plt.figure(figsize=(150, 150))
plot_tree(clf, feature_names=features, class_names=classes, filled=True)
plt.savefig('tree2.png')
plt.show()

Already much better. I personally always save the generated chart separately to be able to view the results in more detail if necessary.

Those customers that belong to the eleventh leaf node from the left are those with a 0 value for the self_employed variable, age greater than 75.5 and a campaign variable with a value of less than 3.5.

In other words, those who are not self employed, older than 75.5 and have come in contact with the campaigns 1-3 have a high chance of converting.

7 Conclusion

The following points were covered in the main chapters 3-6:

Descriptive Analysis at Conversion Rate.
Determine reasons for Marketing Engagement.
Prediction of marketing engagement.
Determination and analysis of the target group that causes conversions.

References

The content of the entire post was created using the following sources:

Hwang, Y. (2019). Hands-On Data Science for Marketing: Improve your marketing strategies with machine learning using Python and R. Packt Publishing Ltd.

Marketing - Conversion Rate Analytics