Important note 5 – Data Preparation and Transformation – MLS-C01 Study Guide

Important note

The testing set cannot be under/oversampled: only the training set should pass through these resampling techniques.

You can also oversample the training set by applying synthetic sampling techniques. Random oversample does not add any new information to the training set: it just duplicates the existing ones. By creating synthetic samples, you are deriving those new observations from the existing ones (instead of simply copying them). This is a type of data augmentation technique known as the Synthetic Minority Oversampling Technique (SMOTE).

Technically, what SMOTE does is plot a line in the feature space of the minority class and extract points that are close to that line.

Important note

You may find questions in your exam where the term SMOTE has been used. If that happens, keep in mind the context where this term is applied: oversampling.

Alright – in the next section, you will learn how to prepare text data for ML models.

Dealing with text data

You have already learned how to transform categorical features into numerical representations, either using label encoders, ordinal encoders, or one-hot encoding. However, what if you have fields containing long pieces of text in your dataset? How are you supposed to provide a mathematical representation for them in order to properly feed ML algorithms? This is a common issue in Natural Language Processing (NLP), a subfield of AI.

NLP models aim to extract knowledge from texts; for example, translating text between languages, identifying entities in a corpus of text (also known as Name Entity Recognition, or NER for short), classifying sentiments from a user review, and many other applications.

Important note

In Chapter 8, AWS Application Services for AI/ML, you will learn about some AWS application services that apply NLP to their solutions, such as Amazon Translate and Amazon Comprehend. During the exam, you might be asked to think about the fastest or easiest way (with the least development effort) to build certain types of NLP applications. The fastest or easiest way is usually to use those out-of-the-box AWS services, since they offer pre-trained models for some use cases (especially machine translation, sentiment analysis, topic modeling, document classification, and entity recognition).

In a few chapters’ time, you will also learn about some built-in AWS algorithms for NLP applications, such as BlazingText, Latent Dirichlet Allocation (LDA), Neural Topic Modeling (NTM), and the Sequence-to-Sequence algorithm. Those algorithms also let you create the same NLP solutions that are created by those out-of-the-box services; however, you have to use them on SageMaker and write your own solution. In other words, they offer more flexibility but demand more development effort.

Keep that in mind for your exam!

Although AWS offers many out-of-the-box services and built-in algorithms that allow you to create NLP applications, you will not look at those AWS product features now (you will do in Chapter 6, Applying Machine Learning Algorithms, and Chapter 8, AWS Application Services for AI/ML). You will finish this chapter by looking at some data preparation techniques that are extremely important for preparing your data for NLP.