Another unsupervised algorithm that was implemented by AWS in its list of built-in algorithms is known as principal component analysis, or PCA for short. PCA is a technique that’s used to reduce the number of variables/dimensions in a dataset.
The main idea behind PCA is plotting the data points to another set of coordinates, known as Principal Components (PCs), which aims to explain the most variance in the data. By definition, the first component will capture more variance than the second component, then the second component will capture more variance than the third one, and so on.
You can set up as many PCs as you need, as long as it does not surpass the number of variables in your dataset. Figure 6.18 shows how these PCs are drawn:
Figure 6.18 – Finding PCs in PCA
As mentioned previously, the first PC will be drawn in such a way that it will capture most of the variance in the data. That is why it passes near the majority of the data points in Figure 6.18.
Then, the second PC will be perpendicular to the first one, so that it will be the second component that explains the variance in the data. If you want to create more components (consequentially, capturing more variance), you just have to follow the same rule of adding perpendicular components. Eigenvectors and eigenvalues are the linear algebra concepts associated with PCA that compute the PCs.
So, what is the story with dimension reduction here? In case it is not clear yet, these PCs can be used to replace your original variables. For example, consider you have 10 variables in your dataset, and you want to reduce this dataset to three variables that best represent the others. A potential solution for that would be applying PCA and extracting the first three PCs!
Do these three components explain 100% of your dataset? Probably not, but ideally, they will explain most of the variance. Adding more PCs will explain more variance but at the cost of adding extra dimensions.
Using AWS’s built-in algorithm for PCA
In AWS, PCA works in two different modes:
The difference is that, in randomized mode, it is used as an approximation algorithm.
Of course, the main hyperparameter of PCA is the number of components that you want to extract, known as num_components.
IP Insights is an unsupervised algorithm that is used for pattern recognition. Essentially, it learns the usage pattern of IPv4 addresses.
The modus operandi of this algorithm is very intuitive: it is trained on top of pairs of events in the format of entity and IPv4 address so that it can understand the pattern of each entity that it was trained on.
Important note
For instance, you can understand “entity” as user IDs or account numbers.
Then, to make predictions, it receives a pair of events with the same data structure (entity, IPv4 address) and returns an anomaly score for that particular IP address, according to the input entity.
Important note
This anomaly score that is returned by IP Insights infers how anomalous the pattern of the event is.
You might come across many applications with IP Insights. For example, you can create an IP Insights model that was trained on top of your application login events (this is your entity). You should be able to expose this model through an API endpoint to make predictions in real time.
Then, during the authentication process of your application, you could call your endpoint and pass the IP address that is trying to log in. If you got a high score (meaning this pattern of logging in looks anomalous), you can request extra information before authorizing access (even if the password was right).
This is just one of the many applications of IP Insights you could think about. Next, you will learn about textual analysis.