Important note 6 – Data Preparation and Transformation – MLS-C01 Study Guide

Important note

The power and simplicity of BoW come from the fact that you can easily come up with a training set to train your algorithms. If you look at Figure 4.11, can you see that having more data and just adding a classification column to that table, such as good or bad review, would allow us to train a binary classification model to predict sentiments?

Alright – you might have noticed that many of the awesome techniques that you have been introduced to come with some downsides. The problem with BoW is the challenge of maintaining its vocabulary. You can easily see that, in a huge corpus of texts, the vocabulary size tends to become bigger and bigger and the matrices’ representations tend to be sparse (yes – the sparsity issue again).

One possible way to solve the vocabulary size issue is by using word hashing (also known in ML as the hashing trick). Hash functions map data of arbitrary sizes to data of a fixed size. This means you can use the hash trick to represent each text with a fixed number of features (regardless of the vocabulary’s size). Technically, this hashing space allows collisions (different texts represented by the same features), so this is something to take into account when you are implementing feature hashing.

TF-IDF

Another problem that comes with BoW, especially when you use the frequency strategy to build the feature space, is that more frequent words will strongly boost their scores due to the high number of occurrences within the document. It turns out that, often, those words with high occurrences are not the key words of the document, but just other words that also appear many times in several other documents.

Term Frequency-Inverse Document Frequency (TF-IDF) helps penalize these types of words, by checking how frequent they are in other documents and using that information to rescale the frequency of the words within the document.

At the end of the process, TF-IDF tends to give more importance to words that are unique to the document (document-specific words). Next, let’s look at a concrete example so that you can understand it in depth.

Consider that you have a text corpus containing 100 words and the word “Amazon” appears three times. The Term Frequency (TF) of this word would be 3/100, which is equal to 0.03. Now, suppose you have other 1,000 documents and that the word “Amazon” appears in 50 of these. In this case, the Inverse Document Frequency (IDF) would be given by the log as 1,000/50, which is equal to 1.30. The TF-IDF score of the word “Amazon,” in that specific document under analysis, will be the product of TF * IDF, which is 0.03 * 1.30 (0.039).

Suppose that instead of 50 documents, the word “Amazon” had also appeared on another 750 documents – in other words, much more frequently than in the prior scenario. In this case, the TF part of this equation will not change – it is still 0.03. However, the IDF piece will change a little, since this time it will be log 1,000/750, which is equal to 0.0036. As you can see, now the word “Amazon” has much less importance than in the previous example.