The Evolution of a ggplot (Ep. 1)

Posted by Cédric on Friday, May 17, 2019

🏁 Aim of this Tutorial

In this series of blog posts, I aim to show you how to turn a default ggplot into a plot that visualizes information in an appealing and easily understandable way. The goal of each blog post is to provide a step-by-step tutorial explaining how my visualization have evolved from a typical basic ggplot. All plots are going to be created with 100% {ggplot2} and 0% Inkscape.

In the first episode, I transform a basic boxplot into a colorful and self-explanatory combination of a jittered dot strip plot and a lollipop plot. I am going to use data provided by the UNESCO on global student to teacher ratios that was selected as data for the #TidyTuesday challenge 19 of 2019.

🗃️ Data Preparation

I have prepared the data in the first way to map each countrys most recently reported student-teacher ratio in primary education as a tile map. I used the tile-based world data provided by Maarten Lambrechts to create this map as the first visualization for my weekly contribution:

For the second chart next to the tile map, I wanted to highlight the difference of the mean student ratio per continent but without discarding the raw data on the country-level. Therefore, I transformed the information on the region to represent the six continents excluding Antarctica (hm, do penguins not go to school?! Seems so… 🐧) and merged both datasets. If you would like to run the code yourself, you find the data preparation steps here. This is how the relevant columns of the merged and cleaned dataset looks like, showing two examples per continent:

## # A tibble: 12 x 5
##    indicator     country              region    student_ratio student_ratio_reg~
##    <chr>         <chr>                <chr>             <dbl>              <dbl>
##  1 Primary Educ~ Lesotho              Africa             32.9               37.3
##  2 Primary Educ~ South Africa         Africa             30.3               37.3
##  3 Primary Educ~ Bangladesh           Asia               30.1               20.7
##  4 Primary Educ~ Viet Nam             Asia               19.6               20.7
##  5 Primary Educ~ Ireland              Europe             16.1               13.6
##  6 Primary Educ~ France               Europe             18.2               13.6
##  7 Primary Educ~ Saint Vincent and t~ North Am~          14.4               17.7
##  8 Primary Educ~ Dominican Republic   North Am~          18.9               17.7
##  9 Primary Educ~ Vanuatu              Oceania            26.6               24.7
## 10 Primary Educ~ Solomon Islands      Oceania            25.8               24.7
## 11 Primary Educ~ Argentina            South Am~          NA                 19.4
## 12 Primary Educ~ Paraguay             South Am~          24.2               19.4

🌱 The Default Boxplot

I was particularly interested to visualize the most-recent student-teacher ratio in primary education as a tile chloropleth map per country. A usual way reresenting several data points per group is to use a boxplot:

library(tidyverse)

ggplot(df_ratios, aes(x = region, y = student_ratio)) +
  geom_boxplot()

🔀 ️Sort Your Data!

A good routine with such kind of data (qualitative and unsorted) is to arrange the boxplots or any other type such as bars or violins in an in- or decreasing order to simplify readability. Since the category “continent” does not have an intrinsic order, I rearrange the boxplots by their mean student-teacher ratio instead of sorting them alphabetically which is the default:

df_sorted <- 
  df_ratios %>%
  mutate(region = fct_reorder(region, -student_ratio_region))

ggplot(df_sorted, aes(x = region, y = student_ratio)) +
  geom_boxplot()

💡 Sort your data according to the best or worst, highest or lowest value to make your graph easily readable—do not sort them if the categories have an internal logical orderinsic, as age groupings!

To increase the readability we are going to flip the coordinates (note that we could also switch x and y arguments in the ggplot call—but this does not work for boxplots so we use coord_flip()). It is also a good practice to include the 0 in plots which we can force by adding scale_y_continuous(limits = c(0, 90)).

ggplot(df_sorted, aes(x = region, y = student_ratio)) +
  geom_boxplot() +
  coord_flip() +
  scale_y_continuous(limits = c(0, 90))

💡 Flip the chart in case of long labels to increase readability and to avoid overlapping or rotated labels!

💡 Since the latest version 3.x.x of {ggplot2} you can also flip the orientation by using the orientation argument geom_*(orientation = "y")

The order of the categories is perfect as it is after flipping the coordinates—the lower the student-teacher ratio, the better.

💎 Let Your Plot Shine—Get Rid of the Default Settings

Let’s spice this plot up! One great thing about {ggplot2} is that it is structured in an adaptive way, allowing to add further levels to an existing ggplot object. We are going to

  • use a different theme that comes with the {ggplot2} package by calling theme_set(theme_light()) (several themes come along with the {ggplot2} package but if you need more check for example the packages ggthemes or hrbrthemes),
  • change the font and the overall font size by adding the arguments base_size and base_family to theme_light(),
  • flip the axes by adding coord_flip() (as seen before),
  • let the axis start at 0 and reduce the spacing to the plot margin by adding expand = c(0.005, 0.005) as argument to the scale_y_continious(),
  • add some color encoding the continent by adding color = region to the aes argument and picking a palette from the ggsci package,
  • add meaningful labels/removing useless labels by adding labs(x = NULL, y = "title y")
  • adjust the new theme (e.g. changing some font settings and removing the legend and grid) by adding theme().

💡 You can easily adjust all sizes of the theme by calling theme_xyz(base_size = )—this is very handy if you need the same viz for a different purpose!

💡 Do not use c(0, 0) since the zero tick is in most cases too close to the axis—use something close to zero instead!

I am going to save the ggplot call and all these visual adjustments in a gg object that I name g so we can use it for the next plots.

library(showtext)
font_add_google("Poppins", "Poppins")
font_add_google("Roboto Mono", "Roboto Mono")
showtext_auto()

theme_set(theme_light(base_size = 18, base_family = "Poppins"))

g <- 
  ggplot(df_sorted, aes(x = region, y = student_ratio, color = region)) +
    coord_flip() +
    scale_y_continuous(limits = c(0, 90), expand = c(0.005, 0.005)) +
    scale_color_uchicago() +
    labs(x = NULL, y = "Student to teacher ratio") +
    theme(
      legend.position = "none",
      axis.title = element_text(size = 16),
      axis.text.x = element_text(family = "Roboto Mono", size = 12),
      panel.grid = element_blank()
    )

(Note that to include these fonts we make use of the extrafont package showtext package. This package allows for the use of system fonts and is especially simple to use in combination with Google fonts. You need to have (a) the fonts installed on your system, (b) the package showtext installed, (c) import the fonts by running font_add_google() and (d) preferably turn on the automatic use of its functionality with showtext_auto().)

📊 The Choice of the Chart Type

We can add any geom_ to our ggplot-preset g that fits the data (note that until now it is just an empty plot with pretty axes):

All of the four chart types let readers explore the range of values but with different detail and focus. The boxplot and the violin plot both summarize the data, they contain a lot of information by visualizing the distribution of the data points in two different ways. By contrast, the line (see below for an explanation how to read a boxplot)plot shows only the range (mimimum and maximum of the data) and the dot strip plot the raw data with each single observation. However, a line chart is not a good choice here since it does not allow for the identification of single countries. By adding an alpha argument to geom_point(), the dot strip plot is able to highlight the range of student-teacher ratios while providing the raw data:

g + geom_point(size = 3, alpha = 0.15)

Of course, different geoms can also be combined to provide even more information in one plot:

g + 
  geom_boxplot(color = "gray60", outlier.alpha = 0) +
  geom_point(size = 3, alpha = 0.15)

Remove the outliers to avoid overlapping points! You can achieve this via outlier.shape = NA or outlier.alpha = 0.

We are going to stick to points to visualize the countries explicitly instead of aggregating the data into box- or violin plots. To achieve a higher readability, we use another geom, geom_jitter() which scatters the points in a given direction (x and/or y via width and height) to prevent overplotting:

set.seed(2019)

g + geom_jitter(size = 2, alpha = 0.25, width = 0.2)

💡 Set a seed to keep the jittering of the points fixed every time you call geom_jitter() by calling set.seed()—this becomes especially important when we later label some of the points.

💡 You can also set the seed within the geom_jitter() call by setting position = position_jitter(seed). Note that in this case the width and/or height argument needs to be placed inside the position_jitter() as well:

g + geom_jitter(position = position_jitter(seed = 2019, width = 0.2), size = 2, alpha = 0.25)

(In the next code chunks, I am going to use the redundant call of set.seed(2019) before creating the plot but do not show it each time.)

💯 More Geoms, More Fun, More Info!

As mentioned in the beginning, my intention was to visualize both, the country- and continental-level ratios, in addition to the tile map. Until now, we focussed on countries only. We can indicate the continental average by adding a summary statistic via stat_summary()with a different point size as the points of geom_jitter(). Since the average is more important here, I am going to highlight it with a bigger size and zero transparency:

g +
  geom_jitter(size = 2, alpha = 0.25, width = 0.2) +
  stat_summary(fun = mean, geom = "point", size = 5)

Note that we could also use geom_point(aes(x = region, y = student_ratio_region), size = 5) to achieve the same since we already have a regional meanaverage in our data.

To relate all these points to a baseline, we add a line indicating the worldwide average:

world_avg <- 
  df_ratios %>%
  summarize(avg = mean(student_ratio, na.rm = T)) %>%
  pull(avg)

g +
  geom_hline(aes(yintercept = world_avg), color = "gray70", size = 0.6) +
  stat_summary(fun = mean, geom = "point", size = 5) +
  geom_jitter(size = 2, alpha = 0.25, width = 0.2)

💡 One could derive the worldwide average also within the geom_hline() call, but I prefer to keep both steps separated.

We can further highlight that the baseline is the worldwide average ratio rather than a ratio of 0 (or 1?) by adding a line from each continental average to the worldwide average. The result is a combination of a jitter and a lollipop plot:

g +
  geom_segment(
    aes(x = region, xend = region,
        y = world_avg, yend = student_ratio_region),
    size = 0.8
  ) +
  geom_hline(aes(yintercept = world_avg), color = "gray70", size = 0.6) +
  geom_jitter(size = 2, alpha = 0.25, width = 0.2) +
  stat_summary(fun = mean, geom = "point", size = 5)

Check the order of the geoms to prevent any overlapping—here, for example, draw the line after calling geom_segment() to avoid overlapping!

💬 Add Text Boxes to Let The Plot Speak for Itself

Since I don’t want to include legends, I add some text boxes that explain the different point sizes and the baseline level via annotate("text"):

(g_text <- 
  g +
  geom_segment(
    aes(x = region, xend = region,
        y = world_avg, yend = student_ratio_region),
    size = 0.8
  ) +
  geom_hline(aes(yintercept = world_avg), color = "gray70", size = 0.6) +
  stat_summary(fun = mean, geom = "point", size = 5) +
  geom_jitter(size = 2, alpha = 0.25, width = 0.2) +
  annotate(
    "text", x = 6.3, y = 35, family = "Poppins", size = 2.8, color = "gray20", lineheight = .9,
    label = glue::glue("Worldwide average:\n{round(world_avg, 1)} students per teacher")
  ) +
  annotate(
    "text", x = 3.5, y = 10, family = "Poppins", size = 2.8, color = "gray20",
    label = "Continental average"
  ) +
  annotate(
    "text", x = 1.7, y = 11, family = "Poppins", size = 2.8, color = "gray20",
    label = "Countries per continent"
  ) +
  annotate(
    "text", x = 1.9, y = 64, family = "Poppins", size = 2.8, color = "gray20", lineheight = .9, 
    label = "The Central African Republic has by far\nthe most students per teacher")
  )

💡 Use glue::glue() to combine strings with variables—this way, you can update your plots without copying and pasting values! (Of course, you can also use your good old friend paste0().)

… and add some arrows to match the text to the visual elements by providing start- and endpoints of the arrows when calling geom_curve(). I am going to draw all arrows with one call—but you could also draw arrow by arrow. This is not that simple as the absolute position depends on the dimension of the plot. Good guess based on the coordinates of the text boxes…

arrows <- 
  tibble(
    x1 = c(6.2, 3.5, 1.7, 1.7, 1.9),
    x2 = c(5.6, 4, 1.9, 2.9, 1.1),
    y1 = c(35, 10, 11, 11, 73), 
    y2 = c(world_avg, 19.4, 14.16, 12, 83.4)
  )

g_text + 
  geom_curve(
    data = arrows, aes(x = x1, y = y1, xend = x2, yend = y2),
    arrow = arrow(length = unit(0.07, "inch")), size = 0.4,
    color = "gray20", curvature = -0.3
  )

… and then adjust, adjust, adjust…

arrows <- 
  tibble(
    x1 = c(6.1, 3.62, 1.8, 1.8, 1.8),
    x2 = c(5.6, 4, 2.18, 2.76, 0.9),
    y1 = c(world_avg + 6, 10.5, 9, 9, 77),
    y2 = c(world_avg + 0.1, 18.4, 14.16, 12, 83.45)
  )

(g_arrows <- 
  g_text +
  geom_curve(
    data = arrows, aes(x = x1, y = y1, xend = x2, yend = y2),
    arrow = arrow(length = unit(0.08, "inch")), size = 0.5,
    color = "gray20", curvature = -0.3
  )
)

💡 Since the curvature is the same for all arrows, one can use different x and y distances and directions between the start end and points to vary their shape!

One last thing that bothers me: A student-teacher ratio of 0 does not make much sense—I definitely prefer to start at a ratio of 1!
And—oh my!—we almost forgot to mention and acknowledge the data source 😨 Let’s quickly also add a plot caption:

(g_final <- 
  g_arrows +
  scale_y_continuous(
    limits = c(0, 90), expand = c(0.005, 0.005),
    breaks = c(1, seq(20, 80, by = 20))
  ) +
  labs(caption = "Data: UNESCO Institute for Statistics") +
  theme(plot.caption = element_text(size = 9, color = "gray50"))
)
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.

🗺️ Bonus: Add a Tile Map as Legend

To make it easier to match the countries of the second plot, the country-level tile map, to each continent we have visualized with our jitter plot, we can add a geographical “legend”. For this, I encode the region by color instead by the country-level ratios:

(map_regions <- 
  df_sorted %>%
  ggplot(aes(x = x, y = y, fill = region, color = region)) +
    geom_tile(color = "white") +
    scale_y_reverse() +
    ggsci::scale_fill_uchicago(guide = F) +
    coord_equal() +
    theme(line = element_blank(),
          panel.background = element_rect(fill = "transparent"),
          plot.background = element_rect(fill = "transparent", color = "transparent"),
          panel.border = element_rect(color = "transparent"),
          strip.background = element_rect(color = "gray20"),
          axis.text = element_blank(),
          plot.margin = margin(0, 0, 0, 0)) +
    labs(x = NULL, y = NULL)
)

… and add this map to the existing plot via annotation_custom(ggplotGrob()):

g_final +
  annotation_custom(ggplotGrob(map_regions), xmin = 2.5, xmax = 7.5, ymin = 55, ymax = 85)

🎄 The Final Evolved Visualization

And here it is, our final plot—evolved from a dreary gray boxplot to a self-explanatory, colorful visualization including the raw data and a tile map legend! 🎉

Thanks for reading, I hope you’ve enjoyed it! Here you find more visualizations I’ve contributed to the #TidyTuesday challenges including my full contribution to week 19 of 2019 we have dissected here:

💻 Complete Code for Final Plot

If you want to create the plot on your own or play around with the code, copy and paste these ~60 lines:

## packages
library(tidyverse)
library(ggsci)
library(showtext)

## load fonts
font_add_google("Poppins", "Poppins")
font_add_google("Roboto Mono", "Roboto Mono")
showtext_auto()

## get data
devtools::source_gist("https://gist.github.com/Z3tt/301bb0c7e3565111770121af2bd60c11")

## tile map as legend
map_regions <- 
  df_ratios %>%
  mutate(region = fct_reorder(region, -student_ratio_region)) %>%
  ggplot(aes(x = x, y = y, fill = region, color = region)) +
    geom_tile(color = "white") +
    scale_y_reverse() +
    scale_fill_uchicago(guide = F) +
    coord_equal() +
    theme_light() +
    theme(
      line = element_blank(),
      panel.background = element_rect(fill = "transparent"),
      plot.background = element_rect(fill = "transparent",
                                     color = "transparent"),
      panel.border = element_rect(color = "transparent"),
      strip.background = element_rect(color = "gray20"),
      axis.text = element_blank(),
      plot.margin = margin(0, 0, 0, 0)
    ) +
    labs(x = NULL, y = NULL)

## calculate worldwide average
world_avg <- 
  df_ratios %>%
  summarize(avg = mean(student_ratio, na.rm = T)) %>%
  pull(avg)

## coordinates for arrows
arrows <- 
  tibble(
    x1 = c(6, 3.65, 1.8, 1.8, 1.8),
    x2 = c(5.6, 4, 2.18, 2.76, 0.9),
    y1 = c(world_avg + 6, 10.5, 9, 9, 77),
    y2 = c(world_avg + 0.1, 18.4, 14.16, 12, 83.42)
  )

## final plot
## set seed to fix position of jittered points
set.seed(2019)

## final plot
df_ratios %>%
  mutate(region = fct_reorder(region, -student_ratio_region)) %>%
  ggplot(aes(x = region, y = student_ratio, color = region)) +
    geom_segment(
      aes(x = region, xend = region,
          y = world_avg, yend = student_ratio_region),
      size = 0.8
    ) +
    geom_hline(aes(yintercept = world_avg), color = "gray70", size = 0.6) +
    stat_summary(fun = mean, geom = "point", size = 5) +
    geom_jitter(size = 2, alpha = 0.25, width = 0.2) +
    coord_flip() +
    annotate(
      "text", x = 6.3, y = 35, family = "Poppins",
      size = 2.7, color = "gray20",
      label = glue::glue("Worldwide average:\n{round(world_avg, 1)} students per teacher")
    ) +
    annotate(
      "text", x = 3.5, y = 10, family = "Poppins",
      size = 2.7, color = "gray20",
      label = "Continental average"
    ) +
    annotate(
      "text", x = 1.7, y = 11, family = "Poppins",
      size = 2.7, color = "gray20",
      label = "Countries per continent"
    ) +
    annotate(
      "text", x = 1.9, y = 64, family = "Poppins",
      size = 2.7, color = "gray20",
      label = "The Central African Republic has by far\nthe most students per teacher"
    ) +
    geom_curve(
      data = arrows, aes(x = x1, xend = x2,
                         y = y1, yend = y2),
      arrow = arrow(length = unit(0.08, "inch")), size = 0.5,
      color = "gray20", curvature = -0.3#
    ) +
    annotation_custom(
      ggplotGrob(map_regions),
      xmin = 2.5, xmax = 7.5, ymin = 55, ymax = 85
    ) +
    scale_y_continuous(
      limits = c(0, 90), expand = c(0.005, 0.005),
      breaks = c(1, seq(20, 80, by = 20))
    ) +
    scale_color_uchicago() +
    labs(
      x = NULL, y = "Student to teacher ratio",
      caption = 'Data: UNESCO Institute for Statistics'
    ) +
    theme_light(base_size = 18, base_family = "Poppins") +
    theme(
      legend.position = "none",
      axis.title = element_text(size = 12),
      axis.text.x = element_text(family = "Roboto Mono", size = 10),
      plot.caption = element_text(size = 9, color = "gray50"),
      panel.grid = element_blank()
    )

📝 Post Scriptum: Mean versus Median

One thing I want to highlight is that the final plot does not contain the same information as the original boxplot. While I have visualized the mean values of each country and across the globe, the box of a Box-and-Whisker plot represents the 25th, 50th, 75th percentile of the data (also known as first, second and third quartile):

In a Box-and-Whisker plot the box visualizes the upper and lower quartiles, so the box spans the interquartile range (IQR) containing 50 percent of the data, and the median is marked by a vertical line inside the box.

In a Box-and-Whisker plot the box visualizes the upper and lower quartiles, so the box spans the interquartile range (IQR) containing 50 percent of the data, and the median is marked by a vertical line inside the box.

The 2nd quartile is known as the median, i.e. 50% of the data points fall below this value and the other 50% are higher than this value. My decision to estimate the mean value was based on the fact that my aim was a visualization that is easily understandable to a large (non-scientific) audience that are used to mean (“average”) values but not to median estimates. However, in case of skewed data, the mean value of a dataset is also biased towards higher or lower values. Let’s compare both a plot based on the mean and the median:

As one can see, the differences between continents stay roughly the same but the worldwide median is lower than the worldwide average (19.6 students per teacher versus 23.5). The plot with medians highlights that the median student-teacher ratio of Asia and Oceania are similar to the worldwide median. This plot now resembles much more the basic boxplot we used in the beginning but may be harder to interpret for some compared to the one visualzing average ratios.