Modern applications use Natural Language Processing (NLP) for several purposes, such as text translation, document classifications, web search, Named Entity Recognition (NER), and many others.
AWS offers a suite of algorithms for most NLP use cases. In the next few subsections, you will have a look at these built-in algorithms for textual analysis.
BlazingText does two different types of tasks: text classification, which is a supervised learning approach that extends the fastText text classifier, and Word2Vec, which is an unsupervised learning algorithm.
BlazingText’s implementations of these two algorithms are optimized to run on large datasets. For example, you can train a model on top of billions of words in a few minutes.
This scalability aspect of BlazingText is possible due to the following:
The Word2Vec option supports batch_skipgram mode, which allows BlazingText to do distributed training across multiple CPUs.
Important note
The distributed training that’s performed by BlazingText uses a mini-batching approach to convert level-1 BLAS (Basic Linear Algebra Subprograms) operations into level-3 BLAS operations. If you see these terms during your exam, you should know that they are related to BlazingText (Word2Vec mode).
Still in Word2Vec mode, BlazingText supports both the skip-gram and Continuous Bag of Words (CBOW) architectures.
Finally, note the following configurations of BlazingText, since they are likely to be present in your exam:
This is a supervised algorithm that transforms an input sequence into an output sequence. This sequence can be a text sentence or even an audio recording.
The most common use cases for sequence-to-sequence are machine translation, text summarization, and speech-to-text. Anything that you think is a sequence-to-sequence problem can be approached by this algorithm.
Technically, AWS SageMaker’s Seq2Seq uses two types of neural networks to create models: an RNN and a Convolutional Neural Network (CNN) with an attention mechanism.
Latent Dirichlet allocation, or LDA for short, is used for topic modeling. Topic modeling is a textual analysis technique where you can extract a set of topics from a corpus of text data. LDA learns these topics based on the probability distribution of the words in the corpus of text.
Since this is an unsupervised algorithm, there is no need to set a target variable. Also, the number of topics must be specified up-front, and you will have to analyze each topic to find its domain meaning.
Just like the LDA algorithm, the Neural Topic Model (NTM) also aims to extract topics from a corpus of data. However, the difference between LDA and NTM is their learning logic. While LDA learns from probability distributions of the words in the documents, NTM is built on top of neural networks.
The NTM network architecture has a bottleneck layer, which creates an embedding representation of the documents. This bottleneck layer contains all the necessary information to predict document composition, and its coefficients can be considered topics.
With that, you have completed this section on textual analysis. In the next section, you will learn about image processing algorithms.