Comparisons are very common in data analysis and there are different ways to present them. Starting with the bar chart, you must have seen many reports that have used this type of visualization.
Bar charts can be used to compare one variable among different classes – for example, a car’s price across different models or population size per country. In Figure 5.3, the bar chart is used to analyze the percentage of positive tests for COVID-19 in a range of regions of India as of April 7th, 2020.
Figure 5.3 – Plotting comparisons with a bar chart (source: State Health Department of India)
Sometimes, you can also use stacked column charts to add another dimension to the data that is being analyzed. For example, Figure 5.4 uses a stacked bar chart to show how many people were on board the Titanic by sex. Additionally, it breaks down the number of people who survived (positive class) and those who did not (negative class), also by sex.
Figure 5.4 – Using a stacked bar chart to analyze the Titanic disaster dataset
As you can see, most of the women survived the disaster, while most of the men did not. The stacked bars help us visualize the difference between the fates of the sexes. Finally, you should know that you can also show percentages on those stacked bars, not just absolute numbers.
Column charts are also useful if you need to compare one or two variables across different periods. For example, in Figure 5.5, you can see the annual Canadian electronic vehicle sales by province.
Figure 5.5 – Plotting comparisons with a column chart (source: https://electrek.co/)
Another very effective way to make comparisons across different periods is by using line charts. Figure 5.6 shows a pretty interesting example of how you can compare different algorithms’ performance, in a particular project, across different release dates.
Important note
Line charts are usually very helpful to indicate whether there is any trend in the data over the period of time under analysis. A very common use case for line charts is forecasting, where you usually have to analyze trends and seasonality in time series data.
For example, in Figure 5.6 you can see that the Classification and Regression Trees (CART) model used to be the poorest performant model compared to other algorithms, such as AdaBoost (ADA), gradient boosting (GB), random forest (RF), and logistic regression (LOGIT).
However, in July, the CART model was optimized, and it turned out to be the third-best model across all other models. The whole story about the best model for each period can easily be seen in Figure 5.6.
Figure 5.6 – Plotting comparisons with a line chart
Finally, you can also show comparisons of your data using tables. Tables are more useful when you have multiple dimensions (usually placed in the rows of the table) and one or multiple metrics to make comparisons against (usually placed in the columns of the table).
In the next section, you will learn about another set of charts that aims to show the distribution of your variables. This set of charts is particularly important for modeling tasks, since you must know the distribution of a feature to think about potential data transformations for it.