11 min read

NLP - Text Vectorization

1 Introduction

Now that we have cleaned up and prepared our text dataset in the previous posts, we come to the next topic: Text Vectorization

Most machine learning algorithms cannot handle string variables. We have to convert them into a format that is readable for machine learning algorithms. Text vectorization is the process of converting text into real numbers. These numbers can be used as input to machine learning models.

In the following, I will use a simple example to show several ways in which vectorization can be done.

Finally, I will apply a vectorization method to the dataset (‘Amazon_Unlocked_Mobile_small_pre_processed.csv’) created and processed in the last post and train a machine learning model on it.

2 Import the Libraries and the Data

import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
df = pd.DataFrame({'Rating': [2,5,3],
                   'Text': ["This is a brown horse",
                            "This horse likes to play",
                            "The horse is in the stable"]})
df

3 Text Vectorization

3.1 Bag-of-Words(BoW)

CountVectorizer() is one of the simplest methods of text vectorization.

It creates a sparse matrix consisting of a set of dummy variables. These indicate whether a certain word occurs in the document or not. The CountVectorizer function matches the word vocabulary, learns it, and creates a document term matrix where the individual cells indicate the frequency of that word in a given document. This is also called term frequency where the columns are dedicated to each word in the corpus.

3.1.2 Functionality

cv = CountVectorizer()

cv_vectorizer = cv.fit(df['Text'])
text_cv_vectorized = cv_vectorizer.transform(df['Text'])

text_cv_vectorized_array = text_cv_vectorized.toarray()

print(text_cv_vectorized_array)
print()
print(text_cv_vectorized_array.shape)

10 different words were found in the text corpus. These can also be output as follows:

cv_vectorizer.get_feature_names_out()

cv_vectorizer.vocabulary_

To make the output a bit more readable we can have it displayed as a dataframe:

cv_vectorized_matrix = pd.DataFrame(text_cv_vectorized.toarray(), 
                                    columns=cv_vectorizer.get_feature_names_out())
cv_vectorized_matrix

How are the rows and columns in the matrix shown above to be read?

  • The rows indicate the documents in the corpus and
  • The columns indicate the tokens in the dictionary

3.1.3 Creation of the final Data Set

Finally, I create a new data set on which to train machine learning algorithms. This time I use the generated array directly to create the final data frame:

cv_df = pd.DataFrame(text_cv_vectorized_array, 
                     columns = cv_vectorizer.get_feature_names_out()).add_prefix('Counts_')

df_new_cv = pd.concat([df, cv_df], axis=1, sort=False)
df_new_cv

However, the bag-of-words method also has two crucial disadvantages:

  • BoW does not preserve the order of words and
  • It does not allow to draw useful conclusions for downstream NLP tasks

3.1.4 Test of a Sample Record

Let’s test a sample record:

new_input = ["Hi this is Mikel."]
new_input

new_input_cv_vectorized = cv_vectorizer.transform(new_input)
new_input_cv_vectorized_array = new_input_cv_vectorized.toarray()
new_input_cv_vectorized_array

new_input_matrix = pd.DataFrame(new_input_cv_vectorized_array, 
                                columns = cv_vectorizer.get_feature_names_out())

new_input_matrix

The words ‘is’ and ‘this’ have been learned by the CountVectorizer and thus get a count here.

new_input = ["You say goodbye and I say hello", "hello world"]
new_input

new_input_cv_vectorized = cv_vectorizer.transform(new_input)
new_input_cv_vectorized_array = new_input_cv_vectorized.toarray()
new_input_cv_vectorized_array

new_input_matrix = pd.DataFrame(new_input_cv_vectorized_array, 
                                columns = cv_vectorizer.get_feature_names_out())

new_input_matrix

In our second example, I did not use any of the words CountVectorizer learned. Therefore all values are 0.

3.2 N-grams

3.2.1 Explanation

First of all, what are n-grams? In a nutshell: an N-gram means a sequence of N words. So for example, “Hi there” is a 2-gram (a bigram), “Hello sunny world” is a 3-gram (trigram) and “Hi this is Mikel” is a 4-gram.

How would this look when vectorizing a text corpus?

Example: “A horse rides on the beach.”

  • Unigram (1-gram): A, horse, rides, on, the, beach
  • Bigram (2-gram): A horse, horse rides, rides on, …
  • Trigram (3-gram): A horse rides, horse rides on, …

Unlike BoW, n-gram maintains word order. They can also be created with the CountVectorizer() function. For this only the ngram_range parameter must be adjusted.

An ngram_range of:

  • (1, 1) means only unigrams
  • (1, 2) means unigrams and bigrams
  • (2, 2) means only bigrams
  • (1, 3) means unigrams, bigrams and trigrams …

Here a short example of this:

example_sentence = ["A horse rides on the beach."]
example_sentence

cv_ngram = CountVectorizer(ngram_range=(1, 3))

cv_ngram_vectorizer = cv_ngram.fit(example_sentence)
cv_ngram_vectorizer.get_feature_names_out()

cv_ngram = CountVectorizer(ngram_range=(2, 3))

cv_ngram_vectorizer = cv_ngram.fit(example_sentence)
cv_ngram_vectorizer.get_feature_names_out()

3.2.2 Functionality

3.2.2.1 Defining ngram_range

Now that we know how the CountVectorizer works with the ngram_range parameter, we will apply it to our sample dataset:

cv_ngram = CountVectorizer(ngram_range=(1, 3))

cv_ngram_vectorizer = cv_ngram.fit(df['Text'])
text_cv_ngram_vectorized = cv_ngram_vectorizer.transform(df['Text'])

text_cv_ngram_vectorized_array = text_cv_ngram_vectorized.toarray()

print(cv_ngram_vectorizer.get_feature_names_out())

The disadvantage of n-gram is that it usually generates too many features and is therefore very computationally expensive. One way to counteract this is to limit the maximum number of features. This can be done with the max_features parameter.

3.2.2.2 Defining max_features

cv_ngram = CountVectorizer(ngram_range=(1, 3),
                           max_features=15)

cv_ngram_vectorizer = cv_ngram.fit(df['Text'])
text_cv_ngram_vectorized = cv_ngram_vectorizer.transform(df['Text'])

text_cv_ngram_vectorized_array = text_cv_ngram_vectorized.toarray()

print(cv_ngram_vectorizer.get_feature_names_out())

cv_ngram_vectorized_matrix = pd.DataFrame(text_cv_ngram_vectorized.toarray(), 
                                          columns=cv_ngram_vectorizer.get_feature_names_out())
cv_ngram_vectorized_matrix

That worked out well. But what I always try to avoid with column names are spaces between the words. But this can be easily corrected:

cv_ngram_vectorized_matrix_columns_list = cv_ngram_vectorized_matrix.columns.to_list()

k = []

for i in cv_ngram_vectorized_matrix_columns_list:
    j = i.replace(' ','_')
    k.append(j)

cv_ngram_vectorized_matrix.columns = [k]
cv_ngram_vectorized_matrix

3.2.3 Creation of the final Data Set

cv_ngram_df = pd.DataFrame(text_cv_ngram_vectorized_array, 
                           columns = cv_ngram_vectorizer.get_feature_names_out()).add_prefix('Counts_')

df_new_cv_ngram = pd.concat([df, cv_ngram_df], axis=1, sort=False)
df_new_cv_ngram.T

3.3 TF-IDF

3.3.1 Explanation

TF-IDF stands for Term Frequency - Inverse Document Frequency . It’s a statistical measure of how relevant a word is with respect to a document in a collection of documents.

TF-IDF consists of two components:

  • Term frequency (TF): The number of times the word occurs in the document
  • Inverse Document Frequency (IDF): A weighting that indicates how common or rare a word is in the overall document set.

Multiplying TF and IDF results in the TF-IDF score of a word in a document. The higher the score, the more relevant that word is in that particular document.

3.3.1.1 Mathematical Formulas

TF-IDF is therefore the product of TF and IDF:

where TF computes the term frequency:

and IDF computes the inverse document frequency:

3.3.1.2 Example Calculation

Here is a simple example. Let’s assume we have the following collection of documents D:

  • Doc1: “I said please and you said thanks”
  • Doc2: “please darling please”
  • Doc3: “please thanks”

The calculation of TF, IDF and TF-IDF is shown in the table below:

Let’s take a closer look at the values to understand the calculation. The word ‘said’ appears twice in the first document, and the total number of words in Doc1 is 7.

The TF value is therefore 2/7.

In the other two documents ‘said’ is not present at all. This results in an IDF value of log(3/1), since there are a total of three documents in the collection and the word ‘said’ appears in one of the three.

The calculation of the TF-IDF value is therefore as follows:

If you look at the values for ‘please’, you will see that this word appears (sometimes several times) in all documents. It is therefore considered common and receives a TF-IDF value of 0.

3.3.1.3 TF-IDF using scikit-learn

Below I will use the TF-IDF vectorizer from scikit-learn, which has two small modifications to the original formula.

The calculation of IDF is as follows:

Here, 1 is added to the numerator and to the denominator. This is to avoid the computational problem of dividing by 0. We also need to add a 1 to the numerator to balance the effect of adding 1 to the denominator.

The second modification is in the calculation of TF-IDF values:

Here, a 1 is again added to IDF so that a zero value of IDF does not result in a complete suppression of TF-IDF. Using the TfidfVectorizer() function on our sample collection clearly shows this effect:

documents = ['I said please and you said thanks',
             'please darling please',
             'please thanks']

tf_idf = TfidfVectorizer()
tf_idf_vectorizer = tf_idf.fit(documents)
documents_tf_idf_vectorized = tf_idf_vectorizer.transform(documents)

documents_tf_idf_vectorized_array = documents_tf_idf_vectorized.toarray()


tf_idf_vectorized_matrix = pd.DataFrame(documents_tf_idf_vectorized.toarray(), 
                                        columns=tf_idf_vectorizer.get_feature_names_out())
tf_idf_vectorized_matrix = tf_idf_vectorized_matrix[['said', 'please', 'and', 'you', 'thanks', 'darling']]
tf_idf_vectorized_matrix.T

Here again for comparison the values calculated using the original formula:

As we can see here very nicely the values for the word ‘please’ were not completely suppressed during the calculation by TfidfVectorizer().

However, the interpretation of TF-IDF remain exactly the same despite these minor adjustments.

Furthermore, the word ‘I’ was not included, because scikit-learn’s vectorizer automatically disregards words with a length of one letter.

Hint:

Scikit-learn also provides the TfidfTransformer() function. But it needs the customized output of CountVectorize as input to calculate the TF-IDF values, see here. In almost all cases you can use TfidfVectorizer directly.

3.3.2 Functionality

tf_idf = TfidfVectorizer()
tf_idf_vectorizer = tf_idf.fit(df['Text'])
text_tf_idf_vectorized = tf_idf_vectorizer.transform(df['Text'])

text_tf_idf_vectorized_array = text_tf_idf_vectorized.toarray()


tf_idf_vectorized_matrix = pd.DataFrame(text_tf_idf_vectorized.toarray(), 
                                        columns=tf_idf_vectorizer.get_feature_names_out())

tf_idf_vectorized_matrix

3.3.3 Creation of the final Data Set

tf_idf_df = pd.DataFrame(text_tf_idf_vectorized_array, 
                         columns = tf_idf_vectorizer.get_feature_names_out()).add_prefix('TF-IDF_')

df_new_tf_idf = pd.concat([df, tf_idf_df], axis=1, sort=False)
df_new_tf_idf.T

4 Best Practice - Application to the Amazon Data Set

As mentioned in the introduction, I will now apply a vectorizer to the dataset Amazon_Unlocked_Mobile_small_pre_processed.csv that I prepared in the last post. Afterwards, I will train a machine learning model on it.

Feel free to download the dataset from my GitHub Repository.

4.1 Import the Dataframe

url = "https://raw.githubusercontent.com/MFuchs1989/Datasets-and-Miscellaneous/main/datasets/NLP/Text%20Pre-Processing%20-%20All%20in%20One/Amazon_Unlocked_Mobile_small_pre_processed.csv"

df_amazon = pd.read_csv(url, error_bad_lines=False)
# Conversion of the desired column to the correct data type
df_amazon['Reviews_cleaned_wo_rare_words'] = df_amazon['Reviews_cleaned_wo_rare_words'].astype('str')
df_amazon.head(3).T

I have already prepared the data set in various ways. You can read about the exact steps here: NLP - Text Pre-Processing - All in One

I will apply the TF-IDF vectorizer to the ‘Reviews_cleaned_wo_rare_words’ column. For this I will create a subset of the original dataframe. Feel free to try the TF-IDF (or any other vectorizer) on the other processed columns and compare the performance of the ML algorithms.

df_amazon_subset = df_amazon[['Label', 'Reviews_cleaned_wo_rare_words']]
df_amazon_subset

x = df_amazon_subset.drop(['Label'], axis=1)
y = df_amazon_subset['Label']

trainX, testX, trainY, testY = train_test_split(x, y, test_size = 0.2)

4.2 TF-IDF Vectorizer

As with scaling or encoding, the .fit command is applied only to the training part. Using these stored metrics, trainX as well as testX is then vectorized.

I still used the additional function .values.astype('U') in the code below. This would not have been necessary at this point, because I already assigned the correct data type to the column ‘Reviews_cleaned_wo_rare_words’ when loading the dataset. To be on the safe side that TfidfVectorizer works, this code part can be kept.

tf_idf = TfidfVectorizer()
tf_idf_vectorizer = tf_idf.fit(trainX['Reviews_cleaned_wo_rare_words'].values.astype('U'))

trainX_tf_idf_vectorized = tf_idf_vectorizer.transform(trainX['Reviews_cleaned_wo_rare_words'].values.astype('U'))
testX_tf_idf_vectorized = tf_idf_vectorizer.transform(testX['Reviews_cleaned_wo_rare_words'].values.astype('U'))


trainX_tf_idf_vectorized_array = trainX_tf_idf_vectorized.toarray()
testX_tf_idf_vectorized_array = testX_tf_idf_vectorized.toarray()

print('Number of features generated: ' + str(len(tf_idf_vectorizer.get_feature_names_out())))

The next step is actually not necessary, since the machine learning models can handle arrays wonderfully.

trainX_tf_idf_vectorized_final = pd.DataFrame(trainX_tf_idf_vectorized_array, 
                                              columns = tf_idf_vectorizer.get_feature_names_out()).add_prefix('TF-IDF_')

testX_tf_idf_vectorized_final = pd.DataFrame(testX_tf_idf_vectorized_array, 
                                             columns = tf_idf_vectorizer.get_feature_names_out()).add_prefix('TF-IDF_')

4.3 Model Training

In the following I will use the Support Vector Machine classifier. Of course you can also try any other one.

clf = SVC(kernel='linear')
clf.fit(trainX_tf_idf_vectorized_final, trainY)

y_pred = clf.predict(testX_tf_idf_vectorized_final)
confusion_matrix = confusion_matrix(testY, y_pred)
print(confusion_matrix)

print('Accuracy: {:.2f}'.format(accuracy_score(testY, y_pred)))

4.4 TF-IDF Vectorizer with ngram_range

The TF-IDF Vectorizer can also be used in combination with n-grams. It has been shown in practice that the use of the parameter analyser='char' in combination with ngram_range not only generates fewer features, which is less computationally intensive, but also often provides the better result.

tf_idf_ngram = TfidfVectorizer(analyzer='char',
                               ngram_range=(2, 3))

tf_idf_ngram_vectorizer = tf_idf_ngram.fit(trainX['Reviews_cleaned_wo_rare_words'].values.astype('U'))

trainX_tf_idf_ngram_vectorized = tf_idf_ngram_vectorizer.transform(trainX['Reviews_cleaned_wo_rare_words'].values.astype('U'))
testX_tf_idf_ngram_vectorized = tf_idf_ngram_vectorizer.transform(testX['Reviews_cleaned_wo_rare_words'].values.astype('U'))


trainX_tf_idf_ngram_vectorized_array = trainX_tf_idf_ngram_vectorized.toarray()
testX_tf_idf_ngram_vectorized_array = testX_tf_idf_ngram_vectorized.toarray()

print('Number of features generated: ' + str(len(tf_idf_ngram_vectorizer.get_feature_names_out())))

trainX_tf_idf_ngram_vectorized_final = pd.DataFrame(trainX_tf_idf_ngram_vectorized_array, 
                                              columns = tf_idf_ngram_vectorizer.get_feature_names_out()).add_prefix('TF-IDF_ngram_')

testX_tf_idf_ngram_vectorized_final = pd.DataFrame(testX_tf_idf_ngram_vectorized_array, 
                                              columns = tf_idf_ngram_vectorizer.get_feature_names_out()).add_prefix('TF-IDF_ngram_')

4.5 Model Training II

clf2 = SVC(kernel='linear')
clf2.fit(trainX_tf_idf_ngram_vectorized_final, trainY)

y_pred2 = clf2.predict(testX_tf_idf_ngram_vectorized_final)
y_pred2 = clf2.predict(testX_tf_idf_ngram_vectorized_final)
confusion_matrix2 = confusion_matrix(testY, y_pred2)
print(confusion_matrix2)

print('Accuracy: {:.2f}'.format(accuracy_score(testY, y_pred2)))

Unfortunately, the performance did not increase. Therefore, we use the first TF-IDF Vectorizer as well as the first ML model.

4.6 Out-Of-The-Box-Data

Finally, I’d like to test some self-generated evaluation comments and see what the model predicts.

Normally, all pre-processing steps that took place during model training should also be applied to new data. These were (to be read in the post NLP - Text Pre-Processing - All in One):

  • Text Cleaning
    • Conversion to Lower Case
    • Removing HTML-Tags
    • Removing URLs
    • Removing Accented Characters
    • Removing Punctuation
    • Removing irrelevant Characters (Numbers and Punctuation)
    • Removing extra Whitespaces
  • Tokenization
  • Removing Stop Words
  • Normalization
  • Removing Single Characters
  • Removing specific Words
  • Removing Rare words

For simplicity, I’ll omit these steps for this example, since I used simple words without punctuation or special characters.

my_rating_comment = ["a great device anytime again", 
                     "has poor reception and a too small display", 
                     "goes like this to some extent has a lot of good but also negative"]

Here is the vectorized data set:

my_rating_comment_vectorized = tf_idf_vectorizer.transform(my_rating_comment)
my_rating_comment_vectorized_array = my_rating_comment_vectorized.toarray()
my_rating_comment_df = pd.DataFrame(my_rating_comment_vectorized_array, 
                                    columns = tf_idf_vectorizer.get_feature_names_out())
my_rating_comment_df

To be able to see which words from my_rating_comment were in the learned vocabulary of the vectorizer (and consequently received a TF-IDF score) I filter the dataset:

my_rating_comment_df_filtered = my_rating_comment_df.loc[:, (my_rating_comment_df != 0).any(axis=0)]
my_rating_comment_df_filtered

Ok let’s predict:

y_pred_my_rating = clf.predict(my_rating_comment_df)

Here is the final result:

my_rating_comment_df_final = pd.DataFrame (my_rating_comment, columns = ['My_Rating'])
my_rating_comment_df_final['Prediction'] = y_pred_my_rating
my_rating_comment_df_final

5 Conclusion

In this post I showed how to generate readable input from text data for machine learning algorithms. Furthermore, I applied a vectorizer to the previously created and cleaned dataset and trained a machine learning model on it. Finally, I showed how to make new predictions using the trained model.