Important note The power and simplicity of BoW come from the fact that you can easily come up with a training set to train your algorithms. If you look at Figure 4.11, can you see that having more data and just adding a classification column to that table, such as good or bad review, would […]
Word embedding Unlike traditional approaches, such as BoW and TD-IDF, modern methods of text representation will take care of the context of the information, as well as the presence or frequency of words. One very popular and powerful approach that follows this concept is known as word embedding. Word embeddings create a dense vector of […]
Bag of words The first one you will learn is known as bag of words (BoW). This is a very common and simple technique, applied to text data, that creates matrix representations to describe the number of words within the text. BoW consists of two main steps: creating a vocabulary and creating a representation of […]
Important note The testing set cannot be under/oversampled: only the training set should pass through these resampling techniques. You can also oversample the training set by applying synthetic sampling techniques. Random oversample does not add any new information to the training set: it just duplicates the existing ones. By creating synthetic samples, you are deriving […]
Dealing with unbalanced datasets At this point, you might have realized why data preparation is probably the longest part of the data scientist’s work. You have learned about data transformation, missing data values, and outliers, but the list of problems goes on. Don’t worry – you are on the right journey to master this topic! […]
Important note In the case of categorical variables, you can replace the missing data with the value that has the highest occurrence in your dataset. The same logic of grouping the dataset according to specific features is still applicable. You can also use more sophisticated methods of imputation, including constructing an ML model to predict […]
Important note Although, in real scenarios, you usually need to treat missing data via exclusion or imputation, never forget that you can always try to look at the source process and check if you can retrieve (or, at least, better understand) the missing data. You may face this option in the exam. If you don’t […]
Handling missing values As the name suggests, missing values refer to the absence of data. Such absences are usually represented by tokens, which may or may not be implemented in a standard way. Although using tokens is standard, the way those tokens are displayed may vary across different platforms. For example, relational databases represent missing […]
Important note During your exam, if you see questions about the Box-Cox transformation, remember that it is a method that can perform many power transformations (according to a lambda parameter), and its end goal is to make the original distribution closer to a normal distribution. Just to conclude this discussion regarding why mathematical transformations can […]
Applying other types of numerical transformations Normalization and standardization rely on your training data to fit their parameters: minimum and maximum values, in the case of normalization, and mean and standard deviation in the case of standard scaling. This also means you must fit those parameters using only your training data and never the testing […]