Reputation: 21410
I have this table with three categorical variables and one numerical variable:
df <- structure(list(Q = c("q_pol", "q_wh", "q_pol", "q_wh"),
median_all = c(0.667362125626559, 0.624735641188929, 0.548153075210995, 0.398574206026083),
median_half = c(-0.350314785114947,1.42461790732669, 0.372537880024059, 0.44085155122463),
median_third = c(-0.93389146143506,0.236025246988988, -1.02912771930043, 0.0361894830862238),
median_quart = c(-0.112157689065904, 0.704777764871505, -0.848709176683769, 1.24452019211073),
Partcpt = c("Not_Answerer", "Not_Answerer", "Answerer", "Answerer")),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L))
I want to visualize how the values in the median*
columns distribute over the three categorical variables using geom_smooth
. To get there I've been doing this:
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
df %>%
# cast all `median*`variables longer:
pivot_longer(-c(Q, Partcpt)) %>%
# rename:
rename(slope_range = name) %>%
# simplify labels:
mutate(slope_range = str_replace(slope_range, ".*_(.*)$", "\\1")) %>%
# convert slope_range to numerical variable:
mutate(slope_range_N = case_when(
slope_range == "all" ~ 1,
slope_range == "half" ~ 2,
slope_range == "third" ~ 3,
TRUE ~ 4)
) %>%
# plot:
ggplot(
aes(x = slope_range_N, y = value, color = Q)) +
geom_smooth(method = "loess")
Two problems here: first, the conversion of slope_range
to numeric seems unprofessional; second, and more importantly, the resulting plot does not show the distribution of value
by Partcpt
. How can that be included as the fourth variable in the plot?
EDIT:
Maybe the following goes some way toward a solution, the basic idea being that the Q
values and the Partcpt
values are cast into a single column (rather than two different ones):
# df with `Q`:
df_Q <- df1 %>%
select(Q, slope_range, value, slope_range_N) %>%
rename(Cat = Q)
# df with `Partcpt`
df_Partcpt <- df1 %>%
select(Partcpt, slope_range, value, slope_range_N) %>%
rename(Cat = Partcpt)
# bind:
plot_df <- bind_rows(df_Q, df_Partcpt)
# plot:
ggplot(plot_df,
aes(x = slope_range_N, y = value, color = Cat)) +
geom_smooth(method = "loess", span = 0.4, se = FALSE)
Just how to have merely two colors for the two Q
values and two line types for the two Partcpt
values, I don't know.
Upvotes: 0
Views: 381
Reputation: 910
From before the edit, I can get the four lines in two colours and dashed vs. solid by mapping the linetype by Partcpt. All I did was add linetype=Partcpt
to the aes()
call.
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
df %>%
# cast all `median*`variables longer:
pivot_longer(-c(Q, Partcpt)) %>%
# rename:
rename(slope_range = name) %>%
# simplify labels:
mutate(slope_range = str_replace(slope_range, ".*_(.*)$", "\\1")) %>%
# convert slope_range to numerical variable:
mutate(slope_range_N = case_when(
slope_range == "all" ~ 1,
slope_range == "half" ~ 2,
slope_range == "third" ~ 3,
TRUE ~ 4)
) %>%
# plot:
ggplot(
aes(x = slope_range_N, y = value, color = Q, linetype=Partcpt)) +
geom_smooth(method = "loess", se=FALSE) +
scale_color_manual(values = c("black", "grey")) # Line changes colours
The plot output looks like this:
Upvotes: 1
Reputation: 16842
This is partly just an improvement on the data wrangling. It looks like you overthought / overengineered your process. When you reshape the data, get numbers from factor levels from your slope variable, which you can do within pivot_longer
. Then map to the linetype. One thing to note is you've only got one observation per slope-participant-Q combination, so your LOESS model isn't totally appropriate with so few observations (you'll get a wall of warnings about this). You could instead use a spline if you wanted.
library(dplyr)
library(tidyr)
library(ggplot2)
df_long <- df %>%
pivot_longer(c(-Q, -Partcpt), names_to = "slope_range",
names_transform = list(slope_range = ~as.numeric(as.factor(.))))
ggplot(df_long, aes(x = slope_range, y = value, color = Q, linetype = Partcpt)) +
geom_smooth(method = loess, se = FALSE) +
guides(linetype = guide_legend(override.aes = list(color = "black")))
You can pass arguments to loess
if you need. Generally with something like this, however, I usually prefer doing the modeling myself across a larger span. I think this is similar to what geom_smooth
does under the hood, but it can be useful to have direct access to it. Here I'll make LOESS models for each Q-participant combo, then use those to predict values for a bunch of points along the domain of slope_range
. Then use geom_line
directly. The default line width is different between geom_smooth
and geom_line
, but you can adjust that easily.
df_modeled <- df_long %>%
group_by(Q, Partcpt) %>%
nest() %>%
mutate(loess_mod = purrr::map(data, ~loess(value ~ slope_range, data = .)),
x = purrr::map(data, ~seq(min(.$slope_range), max(.$slope_range), by = 0.1)),
pred = purrr::map(loess_mod, ~predict(., newdata = unlist(x)))) %>%
unnest(pred, x)
head(df_modeled)
#> # A tibble: 6 × 6
#> # Groups: Q, Partcpt [1]
#> Q Partcpt data loess_mod x pred
#> <chr> <chr> <list> <list> <dbl> <dbl>
#> 1 q_pol Not_Answerer <tibble [4 × 2]> <loess> 1 0.667
#> 2 q_pol Not_Answerer <tibble [4 × 2]> <loess> 1.1 0.598
#> 3 q_pol Not_Answerer <tibble [4 × 2]> <loess> 1.2 0.496
#> 4 q_pol Not_Answerer <tibble [4 × 2]> <loess> 1.3 0.373
#> 5 q_pol Not_Answerer <tibble [4 × 2]> <loess> 1.4 0.236
#> 6 q_pol Not_Answerer <tibble [4 × 2]> <loess> 1.5 0.0949
ggplot(df_modeled, aes(x = x, y = pred, color = Q, linetype = Partcpt)) +
geom_line() +
guides(linetype = guide_legend(override.aes = list(color = "black")))
Upvotes: 3