Word embedding – Data Preparation and Transformation – MLS-C01 Study Guide

Word embedding

Unlike traditional approaches, such as BoW and TD-IDF, modern methods of text representation will take care of the context of the information, as well as the presence or frequency of words. One very popular and powerful approach that follows this concept is known as word embedding. Word embeddings create a dense vector of a fixed length that can store information about the context and meaning of the document.

Each word is represented by a data point in a multidimensional hyper plan, which is known as embedding space. This embedding space will have n dimensions, where each of these dimensions refers to a particular position of this dense vector.

Although it may sound confusing, the concept is actually pretty simple. Suppose you have a list of four words, and you want to plot them in an embedding space of five dimensions. The words are king, queen, live, and castle. Table 4.13 shows how to do this.

Dim 1Dim 2Dim 3Dim 4Dim 5
King0.220.760.770.440.33
Queen0.980.090.670.890.56
Live0.130.990.880.010.55
Castle0.010.890.340.020.90

Table 4.13 – An embedding space representation

Forget the hypothetical numbers in Table 4.13 and focus on the data structure; you will see that each word is now represented by n dimensions in the embedding space. This process of transforming words into vectors can be performed by many different methods, but the most popular ones are word2vec and GloVe.

Once you have each word represented as a vector of a fixed length, you can apply many other techniques to do whatever you need. One very common task is plotting those “words” (actually, their dimensions) in a hyper plan and visually checking how close they are to each other!

Technically, you don’t use this to plot them as-is, since human brains cannot interpret more than three dimensions. Furthermore, you usually apply a dimensionality reduction technique (such as principal component analysis, which you will learn about later) to reduce the number of dimensions to two, and finally plot the words in a Cartesian plan. That’s why you might have seen pictures like the one at the bottom of Table 4.15. Have you ever asked yourself how it is possible to plot words on a graph?

Figure 4.12 – Plotting words

Next, you will learn how the five dimensions shown in Figure 4.12 were built. Again, there are different methods to do this, but you will learn the most popular, which uses a co-occurrence matrix with a fixed context window.

First, you have to come up with some logic to represent each word, keeping in mind that you also have to take their context into consideration. To solve the context requirement, you need to define a fixed context window, which is going to be responsible for specifying how many words will be grouped together for context learning. For instance, assume this fixed context window as 2.

Next, you will create a co-occurrence matrix, which will count the number of occurrences of each pair of words, according to the pre-defined context window. Consider the following text: “I will pass this exam, you will see. I will pass it.”

The context window of the first word “pass” would be the ones in bold: “I will pass this exam, you will see. I will pass it.” Considering this logic, have a look at how many times each pair of words appears in the context window (Figure 4.13).

Figure 4.13 – Co-occurrence matrix

As you can see, the pair of words “I will” appears three times when a context window of size 2 is used:

  1. I will pass this exam, you will see. I will pass it.
  2. I will pass this exam, you will see. I will pass it.
  3. I will pass this exam, you will see. I will pass it.

Looking at Figure 4.13, the same logic should be applied to all other pairs of words, replacing “…” with the associated number of occurrences. You now have a numerical representation for each word!