Chris Ruehlemann
Chris Ruehlemann

Reputation: 21410

Plot table with three categorical variables and one numerical variable using geom_smooth

I have this table with three categorical variables and one numerical variable:

df <- structure(list(Q = c("q_pol", "q_wh", "q_pol", "q_wh"), 
                     median_all = c(0.667362125626559, 0.624735641188929, 0.548153075210995, 0.398574206026083), 
                     median_half = c(-0.350314785114947,1.42461790732669, 0.372537880024059, 0.44085155122463), 
                     median_third = c(-0.93389146143506,0.236025246988988, -1.02912771930043, 0.0361894830862238), 
                     median_quart = c(-0.112157689065904,  0.704777764871505, -0.848709176683769, 1.24452019211073), 
                     Partcpt = c("Not_Answerer", "Not_Answerer", "Answerer", "Answerer")), 
                class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L)) 

I want to visualize how the values in the median* columns distribute over the three categorical variables using geom_smooth. To get there I've been doing this:

library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
df %>%
  # cast all `median*`variables longer:
  pivot_longer(-c(Q, Partcpt)) %>%
  # rename:
  rename(slope_range = name) %>%
  # simplify labels:
  mutate(slope_range = str_replace(slope_range, ".*_(.*)$", "\\1")) %>%
  # convert slope_range to numerical variable:
  mutate(slope_range_N = case_when(
    slope_range == "all" ~ 1,
    slope_range == "half" ~ 2,
    slope_range == "third" ~ 3,
    TRUE                    ~ 4)
  ) %>%
  # plot:
ggplot( 
       aes(x = slope_range_N, y = value, color = Q)) + 
  geom_smooth(method = "loess")

Two problems here: first, the conversion of slope_range to numeric seems unprofessional; second, and more importantly, the resulting plot does not show the distribution of value by Partcpt. How can that be included as the fourth variable in the plot?

enter image description here

EDIT:

Maybe the following goes some way toward a solution, the basic idea being that the Qvalues and the Partcpt values are cast into a single column (rather than two different ones):

# df with `Q`:
df_Q <- df1 %>% 
  select(Q, slope_range,  value, slope_range_N) %>%
  rename(Cat = Q)

# df with `Partcpt`
df_Partcpt <- df1 %>% 
  select(Partcpt, slope_range,  value, slope_range_N) %>%
  rename(Cat = Partcpt)

# bind:
plot_df <- bind_rows(df_Q, df_Partcpt)

# plot:
ggplot(plot_df,
       aes(x = slope_range_N, y = value, color = Cat)) + 
  geom_smooth(method = "loess", span = 0.4, se = FALSE)

Just how to have merely two colors for the two Qvalues and two line types for the two Partcptvalues, I don't know.

enter image description here

Upvotes: 0

Views: 381

Answers (2)

sjp
sjp

Reputation: 910

From before the edit, I can get the four lines in two colours and dashed vs. solid by mapping the linetype by Partcpt. All I did was add linetype=Partcpt to the aes() call.

library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
df %>%
  # cast all `median*`variables longer:
  pivot_longer(-c(Q, Partcpt)) %>%
  # rename:
  rename(slope_range = name) %>%
  # simplify labels:
  mutate(slope_range = str_replace(slope_range, ".*_(.*)$", "\\1")) %>%
  # convert slope_range to numerical variable:
  mutate(slope_range_N = case_when(
    slope_range == "all" ~ 1,
    slope_range == "half" ~ 2,
    slope_range == "third" ~ 3,
    TRUE                    ~ 4)
  ) %>%
  # plot:
  ggplot( 
    aes(x = slope_range_N, y = value, color = Q, linetype=Partcpt)) + 
  geom_smooth(method = "loess", se=FALSE) +
  scale_color_manual(values = c("black", "grey")) # Line changes colours

The plot output looks like this:

enter image description here

Upvotes: 1

camille
camille

Reputation: 16842

This is partly just an improvement on the data wrangling. It looks like you overthought / overengineered your process. When you reshape the data, get numbers from factor levels from your slope variable, which you can do within pivot_longer. Then map to the linetype. One thing to note is you've only got one observation per slope-participant-Q combination, so your LOESS model isn't totally appropriate with so few observations (you'll get a wall of warnings about this). You could instead use a spline if you wanted.

library(dplyr)
library(tidyr)
library(ggplot2)

df_long <- df %>%
  pivot_longer(c(-Q, -Partcpt), names_to = "slope_range", 
               names_transform = list(slope_range = ~as.numeric(as.factor(.))))

ggplot(df_long, aes(x = slope_range, y = value, color = Q, linetype = Partcpt)) +
  geom_smooth(method = loess, se = FALSE) +
  guides(linetype = guide_legend(override.aes = list(color = "black")))

You can pass arguments to loess if you need. Generally with something like this, however, I usually prefer doing the modeling myself across a larger span. I think this is similar to what geom_smooth does under the hood, but it can be useful to have direct access to it. Here I'll make LOESS models for each Q-participant combo, then use those to predict values for a bunch of points along the domain of slope_range. Then use geom_line directly. The default line width is different between geom_smooth and geom_line, but you can adjust that easily.



df_modeled <- df_long %>%
  group_by(Q, Partcpt) %>%
  nest() %>%
  mutate(loess_mod = purrr::map(data, ~loess(value ~ slope_range, data = .)),
         x = purrr::map(data, ~seq(min(.$slope_range), max(.$slope_range), by = 0.1)),
         pred = purrr::map(loess_mod, ~predict(., newdata = unlist(x)))) %>%
  unnest(pred, x)

head(df_modeled)
#> # A tibble: 6 × 6
#> # Groups:   Q, Partcpt [1]
#>   Q     Partcpt      data             loess_mod     x   pred
#>   <chr> <chr>        <list>           <list>    <dbl>  <dbl>
#> 1 q_pol Not_Answerer <tibble [4 × 2]> <loess>     1   0.667 
#> 2 q_pol Not_Answerer <tibble [4 × 2]> <loess>     1.1 0.598 
#> 3 q_pol Not_Answerer <tibble [4 × 2]> <loess>     1.2 0.496 
#> 4 q_pol Not_Answerer <tibble [4 × 2]> <loess>     1.3 0.373 
#> 5 q_pol Not_Answerer <tibble [4 × 2]> <loess>     1.4 0.236 
#> 6 q_pol Not_Answerer <tibble [4 × 2]> <loess>     1.5 0.0949

ggplot(df_modeled, aes(x = x, y = pred, color = Q, linetype = Partcpt)) +
  geom_line() +
  guides(linetype = guide_legend(override.aes = list(color = "black")))

Upvotes: 3

Related Questions