Data Visualization & Labeling in R

R
Tutorial
Author

Solomon Eshun

Published

October 12, 2024

Data visualization plays a crucial role in uncovering insights and communicating findings effectively. Not all data points are created equal, some points may represent outliers or significant values that require special attention. As a result, it’s often important to highlight specific data points that meet certain conditions.

In this post, I’ll demonstrate how to tag or label specific data points in your visualizations using the powerful tools from the tidyverse. With the help of ggplot2 and ggrepel, we’ll create plots that highlight important data points while keeping the visualization clean and readable.

Let’s start by creating a simple dataset with a few variables. We’ll simulate a dataset of 50 individuals with the following features:

suppressMessages(library(tidyverse))

set.seed(101224)

n <- 50

dummy_names <- c("John", "Anna", "Mike", "Sara", "Paul", "Emma", "Leo", "Kate", "Mark","Lisa", 
                 "Tom", "Nina", "Dan", "Eva", "Ben", "Ivy", "Alex", "Rita", "Joe", "Maya", 
                 "Sam", "Lara", "Nick", "Zoe", "Luke","Ella", "Max", "Amy", "Rob", "Tina", 
                 "Sean", "Beth", "Jay", "Lily", "Omar", "Jill", "Roy", "Cara", "Ian", "Mona", 
                 "Fred", "Faye", "Hugo", "Tara", "Owen", "Clara", "Ray", "Nora", "Don", "Lila")

data <- tibble(
  Record_id = dummy_names,
  Temperature = rnorm(n, mean = 37, sd = 1),
  Weight = rnorm(n, mean = 70, sd = 10),
  Height = rnorm(n, mean = 170, sd = 15),
  Exercise = rpois(n, lambda = 3)
)

head(data)
Record_id Temperature Weight Height Exercise
John 36.86387 71.30932 171.3720 4
Anna 36.35553 73.98145 185.0042 2
Mike 38.00372 77.16113 162.5674 0
Sara 35.94630 90.78671 169.2629 5
Paul 37.49920 67.45276 176.5564 1
Emma 35.54057 71.41362 173.4389 1

Now, let’s establish specific tagging criteria to exclude certain points from the dataset. We aim to remove any participant who meets at least one of the following conditions:

data <- data %>%
  mutate(remove = case_when(
    Temperature > 38 ~ TRUE,       
    Weight > 100 ~ TRUE,            
    Height < 150 | Height > 190 ~ TRUE,
    TRUE ~ FALSE                   
  ))

head(data)
Record_id Temperature Weight Height Exercise remove
John 36.86387 71.30932 171.3720 4 FALSE
Anna 36.35553 73.98145 185.0042 2 FALSE
Mike 38.00372 77.16113 162.5674 0 TRUE
Sara 35.94630 90.78671 169.2629 5 FALSE
Paul 37.49920 67.45276 176.5564 1 FALSE
Emma 35.54057 71.41362 173.4389 1 FALSE

Now that we have tagged specific data points based on our conditions, we can visualize them. We will first show all participants who meet any of the tagging conditions - either they have a temperature above 38°C, a weight above 100 kg, or a height outside the normal range (below 150 cm or above 190 cm). We’ll create a scatter plot of weight vs. height and use ggrepel to label the participants who meet at least one of these conditions. ggrepel ensures that the labels do not overlap with the points, making the visualization clean and readable.

ggplot(data, aes(x = Weight, y = Height)) +
  geom_point() +
  ggrepel::geom_label_repel(
    data = filter(data, remove == TRUE),
    aes(label = Record_id),
    size = 3,
    nudge_x = 0.15,
    nudge_y = 0.15,
    color = "#c1121f"
  ) +
  labs(title = " ",
       x = "Weight (kg)", 
       y = "Height (cm)") +
  theme_classic()+
   theme(
    axis.title.x = element_text(size = 15, margin = margin(t = 15), face = "bold"),
    axis.title.y = element_text(size = 15, margin = margin(r = 15), face = "bold"),
    axis.text = element_text(size = 14),
    plot.title = element_text(hjust = 0.5, size = 18, margin = margin(b = 10)),
    panel.border = element_blank(),
    axis.line = element_line(color = "grey", linewidth = 0.5),
    strip.text = element_text(size = 16)
  )

Next, we’ll create a faceted plot where each panel shows one of the three variables: temperature, weight, or height. Participants who meet the specific tagging criteria for that variable will be labeled in the plot. This is useful to examine the distribution of tags for each variable separately.

data <- data %>%
  mutate(remove = if_else(Temperature > 38 | Weight > 100 | Height < 150 | Height > 190, 1, 0))

data_long <- data %>%
  select(-remove) %>%
  pivot_longer(cols = -Record_id, names_to = "variable", values_to = "value")

# Define the filtering conditions for tagging the ids based on the specific variable
data_long <- data_long %>%
  mutate(tag = case_when(
    variable == "Temperature" & value > 38 ~ TRUE,
    variable == "Weight" & value > 100 ~ TRUE,
    variable %in% c("Height") & (value < 150 | value > 190) ~ TRUE,
    TRUE ~ FALSE
  ))

# Plot for each variable, tagging ids based on specific conditions
ggplot(data_long, aes(x = Record_id, y = value)) +
  geom_point() +
  facet_wrap(~ variable, scales = "free_y") +
  ggrepel::geom_label_repel(
    data = filter(data_long, tag == TRUE),
    aes(label = Record_id),
    size = 3,
    nudge_x = 0.15,
    nudge_y = 0.15,
    color = "#c1121f"
  ) +
  labs(title = " ",
       x = "ID", 
       y = "Value") +
  theme_classic()+
  theme(
    axis.title.x = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    axis.title.y = element_text(size = 15, margin = margin(r = 15), face = "bold"),
    axis.text = element_text(size = 14),
    plot.title = element_text(hjust = 0.5, size = 18, margin = margin(b = 10)),
    panel.border = element_blank(),
    axis.line = element_line(color = "grey", linewidth = 0.5),
    strip.text = element_text(size = 16, face = "bold")
  )

I hope you enjoyed it. Feel free to leave a comment below!