a_todd12
a_todd12

Reputation: 602

Calculating var by year to plot geom_line()

I have a dataset with a bunch of observations by year. I just want to calculate percentages of "fail" and "attend", by year, and then plot the yearly trends with geom_line() together on the same plot. I got started with the code below but it's not quite right--it needs to be collapsed by year, I think?

Code:

df %>% 
  group_by(year) %>% 
  mutate(perc_fail = fail/sum(fail),
         perc_attend = attend/sum(attend)) %>% 
  ggplot(., aes(x = year)) +
  geom_line()

Data:

df < -structure(list(year = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L), .Label = c("2000", "2001", "2002", "2003"
), class = "factor"), fail = c(0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 
1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 
0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 
0, 0, 1, 1, 0, 0, 0, 0), attend = c(1, 1, 1, 1, 1, 0, 0, 1, 1, 
1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 
1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 
1, 1, 1, 1, 1, 1, 1, 1, 1)), row.names = c(NA, -60L), spec = structure(list(
    cols = list(year = structure(list(), class = c("collector_double", 
  

Upvotes: 0

Views: 40

Answers (1)

DaveArmstrong
DaveArmstrong

Reputation: 21992

You can use summarise() rather than mutate() to get a single value per year and then plot. Note that when you're plotting different series from different variables, you can put the label you want in the legend in the aesthetic (as I did for colour in both geom_line() calls.

library(dplyr)
library(tidyr)
library(ggplot2)

df %>% 
  group_by(year) %>% 
  summarise(perc_fail = mean(fail),
         perc_attend = mean(attend)) %>% 
  ggplot(., aes(x = year, group=1)) +
  geom_line(aes(y= perc_fail, colour="Fail")) + 
  geom_line(aes(y=perc_attend, colour="Attend")) + 
  labs(y="Percent", 
       x="Year", 
       colour ="") + 
  scale_y_continuous(labels=~scales::percent(.x))

enter image description here

You could also pivot the data to long format and use state_summary() to generate the summary statistics for you. The code below will produce the same graph.

df %>% 
  mutate(year = as.numeric(as.character(year))) %>% 
  pivot_longer(c("fail", "attend"), names_to="status", values_to = "vals") %>% 
  ggplot(aes(x=year, y = vals, colour=status)) + 
  stat_summary(fun = mean, geom="line") +  
  labs(y="Percent", 
       x="Year", 
       colour ="") + 
  scale_y_continuous(labels=~scales::percent(.x))

Upvotes: 1

Related Questions