Textual analysis – Applying Machine Learning Algorithms – MLS-C01 Study Guide

Textual analysis

Modern applications use Natural Language Processing (NLP) for several purposes, such as text translation, document classifications, web search, Named Entity Recognition (NER), and many others.

AWS offers a suite of algorithms for most NLP use cases. In the next few subsections, you will have a look at these built-in algorithms for textual analysis.

BlazingText algorithm

BlazingText does two different types of tasks: text classification, which is a supervised learning approach that extends the fastText text classifier, and Word2Vec, which is an unsupervised learning algorithm.

BlazingText’s implementations of these two algorithms are optimized to run on large datasets. For example, you can train a model on top of billions of words in a few minutes.

This scalability aspect of BlazingText is possible due to the following:

  • Its ability to use multi-core CPUs and a single GPU to accelerate text classification
  • Its ability to use multi-core CPUs or GPUs, with custom CUDA kernels for GPU acceleration, when playing with the Word2Vec algorithm

The Word2Vec option supports batch_skipgram mode, which allows BlazingText to do distributed training across multiple CPUs.

Important note

The distributed training that’s performed by BlazingText uses a mini-batching approach to convert level-1 BLAS (Basic Linear Algebra Subprograms) operations into level-3 BLAS operations. If you see these terms during your exam, you should know that they are related to BlazingText (Word2Vec mode).

Still in Word2Vec mode, BlazingText supports both the skip-gram and Continuous Bag of Words (CBOW) architectures.

Finally, note the following configurations of BlazingText, since they are likely to be present in your exam:

  • In Word2Vec mode, only the train channel is available.
  • BlazingText expects a single text file with space-separated tokens. Each line of the file must contain a single sentence. This means you usually have to preprocess your corpus of data before using BlazingText.

Sequence-to-sequence algorithm

This is a supervised algorithm that transforms an input sequence into an output sequence. This sequence can be a text sentence or even an audio recording.

The most common use cases for sequence-to-sequence are machine translation, text summarization, and speech-to-text. Anything that you think is a sequence-to-sequence problem can be approached by this algorithm.

Technically, AWS SageMaker’s Seq2Seq uses two types of neural networks to create models: an RNN and a Convolutional Neural Network (CNN) with an attention mechanism.

Latent Dirichlet allocation, or LDA for short, is used for topic modeling. Topic modeling is a textual analysis technique where you can extract a set of topics from a corpus of text data. LDA learns these topics based on the probability distribution of the words in the corpus of text.

Since this is an unsupervised algorithm, there is no need to set a target variable. Also, the number of topics must be specified up-front, and you will have to analyze each topic to find its domain meaning.

Neural Topic Model algorithm

Just like the LDA algorithm, the Neural Topic Model (NTM) also aims to extract topics from a corpus of data. However, the difference between LDA and NTM is their learning logic. While LDA learns from probability distributions of the words in the documents, NTM is built on top of neural networks.

The NTM network architecture has a bottleneck layer, which creates an embedding representation of the documents. This bottleneck layer contains all the necessary information to predict document composition, and its coefficients can be considered topics.

With that, you have completed this section on textual analysis. In the next section, you will learn about image processing algorithms.