Bag of words – Data Preparation and Transformation – MLS-C01 Study Guide

Bag of words

The first one you will learn is known as bag of words (BoW). This is a very common and simple technique, applied to text data, that creates matrix representations to describe the number of words within the text. BoW consists of two main steps: creating a vocabulary and creating a representation of the presence of those known words from the vocabulary in the text. These steps can be seen in Figure 4.11.

Figure 4.11 – BoW in action

First things first, you usually can’t use raw text to prepare a BoW representation. There is a data cleansing step where you lowercase the text; split each work into tokens; remove punctuation, non-alphabetical, and stop words; and, whenever necessary, apply any other custom cleansing techniques you may want.

Once you have cleansed your raw text, you can add each word to a global vocabulary. Technically, this is usually a dictionary of tuples, in the form {(word, number of occurrences)} – for example, {(apple, 10), (watermelon, 20)}. As I mentioned previously, this is a global dictionary, and you should consider all the texts you are analyzing.

Now, with the cleansed text and updated vocabulary, you can build your text representation in the form of a matrix, where each column represents one word from the global vocabulary and each row represents a text you have analyzed. The way you represent those texts on each row may vary according to different strategies, such as binary, frequency, and count. Next, you will learn these strategies a little more.

In Figure 4.11, a single piece of text is being processed with the three different strategies for BoW. That’s why you can see three rows on that table, instead of just one (in a real scenario, you have to choose one of them for implementation).

In the first row, it was used a binary strategy, which will assign 1 if the word exists in the global vocabulary and 0 if not. Because the vocabulary was built on a single text, all the words from that text belong to the vocabulary (the reason you can only see 1s in the binary strategy).

In the second row, it was used a frequency strategy, which will check the number of occurrences of each word within the text and divide it by the total number of words within the text. For example, the word “this” appears just once (1) and there are seven other words in the text (7), so 1/7 is equal to 0.14.

Finally, in the third row, it was used a count strategy, which is a simple count of occurrences of each word within the text.

Important note

This note is really important – you are likely to find it in your exam. You may have noticed that the BoW matrix contains unique words in the columns and each text representation is in the rows. If you have 100 long pieces of text with only 50 unique words across them, your BoW matrix will have 50 columns and 100 rows. During your exam, you are likely to receive a list of pieces of text and be asked to prepare the BoW matrix.

There is one more extremely important concept you should know about BoW, which is the n-gram configuration. The term n-gram is used to describe the way you would like to look at your vocabulary, either via single words (uni-gram), groups of two words (bi-gram), groups of three words (tri-gram), or even groups of n words (n-gram). So far, you have seen BoW representations using a uni-gram approach, but more sophisticated representations of BoW may use bi-grams, tri-grams, or n-grams.

The main logic itself does not change, but you need to know how to represent n-grams in BoW. Still using the example from Figure 4.11, a bi-gram approach would combine those words in the following way: [this movie, movie really, really good, good although, although old, old production]. Make sure you understand this before taking the exam.