Data Visualization & Labeling in R

Tutorial

Author

Solomon Eshun

Published

October 12, 2024

Data visualization plays a crucial role in uncovering insights and communicating findings effectively. Not all data points are created equal, some points may represent outliers or significant values that require special attention. As a result, it’s often important to highlight specific data points that meet certain conditions.

In this post, I’ll demonstrate how to tag or label specific data points in your visualizations using the powerful tools from the tidyverse. With the help of ggplot2 and ggrepel, we’ll create plots that highlight important data points while keeping the visualization clean and readable.

Let’s start by creating a simple dataset with a few variables. We’ll simulate a dataset of 50 individuals with the following features:

Temperature: Body temperature (°C), normally distributed around 37°C.
Weight: Weight (kg), normally distributed with a mean of 70 kg.
Height: Height (cm), normally distributed around 170 cm.
Exercise: Hours of exercise per week, generated using a Poisson distribution.

suppressMessages(library(tidyverse))

set.seed(101224)

n <- 50

dummy_names <- c("John", "Anna", "Mike", "Sara", "Paul", "Emma", "Leo", "Kate", "Mark","Lisa", 
                 "Tom", "Nina", "Dan", "Eva", "Ben", "Ivy", "Alex", "Rita", "Joe", "Maya", 
                 "Sam", "Lara", "Nick", "Zoe", "Luke","Ella", "Max", "Amy", "Rob", "Tina", 
                 "Sean", "Beth", "Jay", "Lily", "Omar", "Jill", "Roy", "Cara", "Ian", "Mona", 
                 "Fred", "Faye", "Hugo", "Tara", "Owen", "Clara", "Ray", "Nora", "Don", "Lila")

data <- tibble(
  Record_id = dummy_names,
  Temperature = rnorm(n, mean = 37, sd = 1),
  Weight = rnorm(n, mean = 70, sd = 10),
  Height = rnorm(n, mean = 170, sd = 15),
  Exercise = rpois(n, lambda = 3)
)

head(data)

Record_id	Temperature	Weight	Height	Exercise
John	36.86387	71.30932	171.3720	4
Anna	36.35553	73.98145	185.0042	2
Mike	38.00372	77.16113	162.5674	0
Sara	35.94630	90.78671	169.2629	5
Paul	37.49920	67.45276	176.5564	1
Emma	35.54057	71.41362	173.4389	1

Now, let’s establish specific tagging criteria to exclude certain points from the dataset. We aim to remove any participant who meets at least one of the following conditions:

Temperatures above 38°C.
Weights above 100 kg.
Heights below 150 cm or above 190 cm (extremes).

data <- data %>%
  mutate(remove = case_when(
    Temperature > 38 ~ TRUE,       
    Weight > 100 ~ TRUE,            
    Height < 150 | Height > 190 ~ TRUE,
    TRUE ~ FALSE                   
  ))

head(data)

Record_id	Temperature	Weight	Height	Exercise	remove
John	36.86387	71.30932	171.3720	4	FALSE
Anna	36.35553	73.98145	185.0042	2	FALSE
Mike	38.00372	77.16113	162.5674	0	TRUE
Sara	35.94630	90.78671	169.2629	5	FALSE
Paul	37.49920	67.45276	176.5564	1	FALSE
Emma	35.54057	71.41362	173.4389	1	FALSE

Now that we have tagged specific data points based on our conditions, we can visualize them. We will first show all participants who meet any of the tagging conditions - either they have a temperature above 38°C, a weight above 100 kg, or a height outside the normal range (below 150 cm or above 190 cm). We’ll create a scatter plot of weight vs. height and use ggrepel to label the participants who meet at least one of these conditions. ggrepel ensures that the labels do not overlap with the points, making the visualization clean and readable.

ggplot(data, aes(x = Weight, y = Height)) +
  geom_point() +
  ggrepel::geom_label_repel(
    data = filter(data, remove == TRUE),
    aes(label = Record_id),
    size = 3,
    nudge_x = 0.15,
    nudge_y = 0.15,
    color = "#c1121f"
  ) +
  labs(title = " ",
       x = "Weight (kg)", 
       y = "Height (cm)") +
  theme_classic()+
   theme(
    axis.title.x = element_text(size = 15, margin = margin(t = 15), face = "bold"),
    axis.title.y = element_text(size = 15, margin = margin(r = 15), face = "bold"),
    axis.text = element_text(size = 14),
    plot.title = element_text(hjust = 0.5, size = 18, margin = margin(b = 10)),
    panel.border = element_blank(),
    axis.line = element_line(color = "grey", linewidth = 0.5),
    strip.text = element_text(size = 16)
  )

Next, we’ll create a faceted plot where each panel shows one of the three variables: temperature, weight, or height. Participants who meet the specific tagging criteria for that variable will be labeled in the plot. This is useful to examine the distribution of tags for each variable separately.

data <- data %>%
  mutate(remove = if_else(Temperature > 38 | Weight > 100 | Height < 150 | Height > 190, 1, 0))

data_long <- data %>%
  select(-remove) %>%
  pivot_longer(cols = -Record_id, names_to = "variable", values_to = "value")

# Define the filtering conditions for tagging the ids based on the specific variable
data_long <- data_long %>%
  mutate(tag = case_when(
    variable == "Temperature" & value > 38 ~ TRUE,
    variable == "Weight" & value > 100 ~ TRUE,
    variable %in% c("Height") & (value < 150 | value > 190) ~ TRUE,
    TRUE ~ FALSE
  ))

# Plot for each variable, tagging ids based on specific conditions
ggplot(data_long, aes(x = Record_id, y = value)) +
  geom_point() +
  facet_wrap(~ variable, scales = "free_y") +
  ggrepel::geom_label_repel(
    data = filter(data_long, tag == TRUE),
    aes(label = Record_id),
    size = 3,
    nudge_x = 0.15,
    nudge_y = 0.15,
    color = "#c1121f"
  ) +
  labs(title = " ",
       x = "ID", 
       y = "Value") +
  theme_classic()+
  theme(
    axis.title.x = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    axis.title.y = element_text(size = 15, margin = margin(r = 15), face = "bold"),
    axis.text = element_text(size = 14),
    plot.title = element_text(hjust = 0.5, size = 18, margin = margin(b = 10)),
    panel.border = element_blank(),
    axis.line = element_line(color = "grey", linewidth = 0.5),
    strip.text = element_text(size = 16, face = "bold")
  )

I hope you enjoyed it. Feel free to leave a comment below!