Visualizing distributions in your data – Data Understanding and Visualization – MLS-C01 Study Guide

Visualizing distributions in your data

Exploring the distribution of your feature is very important to understand some key characteristics of it, such as its skewness, mean, median, and quantiles. You can easily visualize skewness by plotting a histogram. This type of chart groups your data into bins or buckets and performs counts on top of them. For example, Figure 5.7 shows a histogram for the age variable:

Figure 5.7 – Plotting distributions with a histogram

Looking at the histogram, you could conclude that most of the people are between 20 and 50 years old. You can also see a few people more than 60 years old. Another example of a histogram is shown in Figure 5.8, which plots the distribution of payments from a particular event that has different ticket prices. It aims to analyze how much money people are paying per ticket.

Figure 5.8 – Checking skewness with a histogram

Here, you can see that most of the people are paying a maximum of 100 dollars per ticket. That is the reason why you can see a skewed distribution to the right-hand side (the tail side).

If you want to see other characteristics of the distribution, such as its median, quantiles, and outliers, then you should use box plots. In Figure 5.9, there is another performance comparison of different algorithms in a given dataset.

These algorithms were executed many times, in a cross-validation process, which resulted in multiple outputs for the same algorithm; for example, one accuracy metric for each execution of the algorithm on each fold.

Since there are multiple accuracy metrics for each algorithm, you can use a box plot to check how each of those algorithms performed during the cross-validation process.

Figure 5.9 – Plotting distributions with a box plot

Here, you can see that box plots can present some information about the distribution of the data, such as its median, lower quartile, upper quartile, and outliers. For a complete understanding of each element of a box plot, take a look at Figure 5.10.

Figure 5.10 – Box plot elements

By analyzing the box plot shown in Figure 5.9, you could conclude that the ADA algorithm has presented some outliers during the cross-validation process, since one of the executions resulted in a very good model (around 92% accuracy). All the other executions of AdaBoost resulted in less than 85% accuracy, with a median value of around 80%.

Another conclusion you could make after analyzing Figure 5.9 is that the CART algorithm presented the poorest performance during the cross-validation process (the lowest median and lower quartile).

Before you wrap up this section, note that you can also use a scatter plot to analyze data distributions when you have more than one variable. Next, you will look at another set of charts that is useful for showing compositions in your data.

Visualizing compositions in your data

Sometimes, you want to analyze the various elements that compose a feature – for example, the percentage of sales per region or percentage of queries per channel. In both examples, they are not considering any time dimension; instead, they are just looking at the entire data points. For these types of compositions, where you don’t have the time dimension, you could show your data using pie charts, stacked 100% bar charts, and tree maps.

Figure 5.11 is a pie chart showing the number of queries per customer channel for a given company over a pre-defined period of time.

Figure 5.11 – Plotting compositions with a pie chart

If you want to show compositions while considering a time dimension, then your most common options are a stacked area chart, a stacked 100% area chart, a stacked column chart, or a stacked 100% column chart. For reference, take a look at Figure 5.12, which shows the sales per region from 2016 until 2020.

Figure 5.12 – Plotting compositions with a stacked 100% column chart

As you can see, stacked 100% column charts help us understand compositions across different periods.