Plot table with three categorical variables and one numerical variable using geom_smooth

Question

I have this table with three categorical variables and one numerical variable:

df <- structure(list(Q = c("q_pol", "q_wh", "q_pol", "q_wh"), 
                     median_all = c(0.667362125626559, 0.624735641188929, 0.548153075210995, 0.398574206026083), 
                     median_half = c(-0.350314785114947,1.42461790732669, 0.372537880024059, 0.44085155122463), 
                     median_third = c(-0.93389146143506,0.236025246988988, -1.02912771930043, 0.0361894830862238), 
                     median_quart = c(-0.112157689065904,  0.704777764871505, -0.848709176683769, 1.24452019211073), 
                     Partcpt = c("Not_Answerer", "Not_Answerer", "Answerer", "Answerer")), 
                class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L))

I want to visualize how the values in the median* columns distribute over the three categorical variables using geom_smooth. To get there I've been doing this:

library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
df %>%
  # cast all `median*`variables longer:
  pivot_longer(-c(Q, Partcpt)) %>%
  # rename:
  rename(slope_range = name) %>%
  # simplify labels:
  mutate(slope_range = str_replace(slope_range, ".*_(.*)$", "\1")) %>%
  # convert slope_range to numerical variable:
  mutate(slope_range_N = case_when(
    slope_range == "all" ~ 1,
    slope_range == "half" ~ 2,
    slope_range == "third" ~ 3,
    TRUE                    ~ 4)
  ) %>%
  # plot:
ggplot( 
       aes(x = slope_range_N, y = value, color = Q)) + 
  geom_smooth(method = "loess")

Two problems here: first, the conversion of slope_range to numeric seems unprofessional; second, and more importantly, the resulting plot does not show the distribution of value by Partcpt. How can that be included as the fourth variable in the plot?

EDIT:

Maybe the following goes some way toward a solution, the basic idea being that the Qvalues and the Partcpt values are cast into a single column (rather than two different ones):

# df with `Q`:
df_Q <- df1 %>% 
  select(Q, slope_range,  value, slope_range_N) %>%
  rename(Cat = Q)

# df with `Partcpt`
df_Partcpt <- df1 %>% 
  select(Partcpt, slope_range,  value, slope_range_N) %>%
  rename(Cat = Partcpt)

# bind:
plot_df <- bind_rows(df_Q, df_Partcpt)

# plot:
ggplot(plot_df,
       aes(x = slope_range_N, y = value, color = Cat)) + 
  geom_smooth(method = "loess", span = 0.4, se = FALSE)

Just how to have merely two colors for the two Qvalues and two line types for the two Partcptvalues, I don't know.

camille · Accepted Answer

This is partly just an improvement on the data wrangling. It looks like you overthought / overengineered your process. When you reshape the data, get numbers from factor levels from your slope variable, which you can do within pivot_longer. Then map to the linetype. One thing to note is you've only got one observation per slope-participant-Q combination, so your LOESS model isn't totally appropriate with so few observations (you'll get a wall of warnings about this). You could instead use a spline if you wanted.

library(dplyr)
library(tidyr)
library(ggplot2)

df_long <- df %>%
  pivot_longer(c(-Q, -Partcpt), names_to = "slope_range", 
               names_transform = list(slope_range = ~as.numeric(as.factor(.))))

ggplot(df_long, aes(x = slope_range, y = value, color = Q, linetype = Partcpt)) +
  geom_smooth(method = loess, se = FALSE) +
  guides(linetype = guide_legend(override.aes = list(color = "black")))

You can pass arguments to loess if you need. Generally with something like this, however, I usually prefer doing the modeling myself across a larger span. I think this is similar to what geom_smooth does under the hood, but it can be useful to have direct access to it. Here I'll make LOESS models for each Q-participant combo, then use those to predict values for a bunch of points along the domain of slope_range. Then use geom_line directly. The default line width is different between geom_smooth and geom_line, but you can adjust that easily.



df_modeled <- df_long %>%
  group_by(Q, Partcpt) %>%
  nest() %>%
  mutate(loess_mod = purrr::map(data, ~loess(value ~ slope_range, data = .)),
         x = purrr::map(data, ~seq(min(.$slope_range), max(.$slope_range), by = 0.1)),
         pred = purrr::map(loess_mod, ~predict(., newdata = unlist(x)))) %>%
  unnest(pred, x)

head(df_modeled)
#> # A tibble: 6 × 6
#> # Groups:   Q, Partcpt [1]
#>   Q     Partcpt      data             loess_mod     x   pred
#>                             
#> 1 q_pol Not_Answerer       1   0.667 
#> 2 q_pol Not_Answerer       1.1 0.598 
#> 3 q_pol Not_Answerer       1.2 0.496 
#> 4 q_pol Not_Answerer       1.3 0.373 
#> 5 q_pol Not_Answerer       1.4 0.236 
#> 6 q_pol Not_Answerer       1.5 0.0949

ggplot(df_modeled, aes(x = x, y = pred, color = Q, linetype = Partcpt)) +
  geom_line() +
  guides(linetype = guide_legend(override.aes = list(color = "black")))

Plot table with three categorical variables and one numerical variable using geom_smooth

Answers (2)

Related Questions