Jamalan
Jamalan

Reputation: 580

Efficient dplyr-style operations

I have these two toy example tables:
Table 1:

attendance_events <- data.frame(student_id = c("RA123","RB123","RC123","RA456","RB456","RC456","RA123","RB123","RC123","RA456","RB456","RC456"),
                                dates = c("2020-02-01","2020-02-01","2020-02-01","2020-02-01","2020-02-01","2020-02-01","2020-02-02","2020-02-02","2020-02-02","2020-02-02","2020-02-02","2020-02-02"),
                                attendance = c(1,1,1,0,1,1,0,0,1,0,0,1),
                                stringsAsFactors = F)
attendance_events
   student_id      dates attendance
1       RA123 2020-02-01          1
2       RB123 2020-02-01          1
3       RC123 2020-02-01          1
4       RA456 2020-02-01          0
5       RB456 2020-02-01          1
6       RC456 2020-02-01          1
7       RA123 2020-02-02          0
8       RB123 2020-02-02          0
9       RC123 2020-02-02          1
10      RA456 2020-02-02          0
11      RB456 2020-02-02          0
12      RC456 2020-02-02          1

Table2:

all_students <- data.frame(student_id = c("RA123","RB123","RC123","RA456","RB456",'RC456'),
                           school_id = c(1,1,1,1,1,2),
                           grade_level = c(10,10,9,9,11,11),
                           date_of_birth = c("1990-02-02","1990-02-02","1991-01-01","1991-02-01","1989-02-02","1989-02-02"),
                           hometown = c("farm","farm","farm","farm","farm","city"),
                           stringsAsFactors = F)
> all_students
  student_id school_id grade_level date_of_birth hometown
1      RA123         1          10    1990-02-02     farm
2      RB123         1          10    1990-02-02     farm
3      RC123         1           9    1991-01-01     farm
4      RA456         1           9    1991-02-01     farm
5      RB456         1          11    1989-02-02     farm
6      RC456         2          11    1989-02-02     city

attendance in attendance_events is 0 if the student was absent that day.

My question is what is the most efficient way in R to find the grade_level that had the largest drop off in attendance between "2020-02-01" and "2020-02-02"

My code is:

#Only include absences because it will be a smaller dataset
att_ws_alt <- inner_join(attendance_events, all_students[,c("student_id","grade_level")], by = "student_id") %>%
              filter(attendance == 0)

#Set days to check between
date_from <- "2020-02-01"
date_to <- "2020-02-02"

#Continously pipe to not have to store and reference(?)
att_drop_alt <- att_ws_alt %>%
                filter(dates %in% c(date_from, date_to)) %>%
                group_by(grade_level,dates) %>%
                summarize(absence_bydate = n()) %>%
                dcast(grade_level ~ dates) %>% 
                sapply(FUN = function(x) { x[is.na(x)] <- 0; x}) %>%
                as.data.frame() %>%
                mutate("absence_change" = .[,3] - .[,2]) %>%
                select(grade_level, absence_change) %>%
                arrange(desc(absence_change))
>att_drop_alt
  grade_level absence_change
1          10              2
2          11              1
3           9              0

However, this feels a bit complex for what seems like a reasonably simple question. I want to see other ways R programmers could answer this question, ideally for better performance but even readability would be good to see.

Thanks community!

Upvotes: 0

Views: 75

Answers (3)

akrun
akrun

Reputation: 887531

With data.table

library(data.table)
setDT(attendance_events)[all_students, .SD[, .(sum(attendance)), 
  .(grade_level, dates)], on = .(student_id)][, 
       .(attendanace_change = diff(rev(V1))), .(grade_level)]
#   grade_level attendanace_change
#1:          10                  2
#2:           9                  0
#3:          11                  1

Upvotes: 3

dario
dario

Reputation: 6483

Sorry if this doesn't exactly answer your question, but I wouldn't want to unfairly accuse the students of being more absent then they were ;)

library(dplyr)
all_students %>% 
  left_join(attendance_events) %>% 
  mutate(dates = as.Date(dates)) %>% 
  group_by(grade_level, dates) %>% 
    summarise(NAbs = sum(ifelse(attendance == 0, 1, 0)),
              N = n(),
              pctAbs = NAbs / n() * 100) %>% 
  arrange(dates) %>%
  mutate(change =  pctAbs - lag(pctAbs)) %>% 
  ungroup() %>% 
  arrange(change)



  # A tibble: 6 x 6
    dates      grade_level  NAbs     N pctAbs change
   <date>           <dbl> <dbl> <int>  <dbl>  <dbl>
  1 2020-02-02           9     1     2     50      0
  2 2020-02-02          11     1     2     50     50
  3 2020-02-02          10     2     2    100    100
  4 2020-02-01           9     1     2     50     NA
  5 2020-02-01          10     0     2      0     NA
  6 2020-02-01          11     0     2      0     NA

Upvotes: 2

Allan Cameron
Allan Cameron

Reputation: 174293

I guess this is a little more concise:

left_join(attendance_events, all_students, by = "student_id") %>% 
  group_by(grade_level, dates) %>% 
  summarise(attendance = sum(attendance)) %>% 
  group_by(grade_level) %>% 
  summarize(attendance_change = diff(attendance))
#> # A tibble: 3 x 2
#>   grade_level attendance_change
#>         <dbl>             <dbl>
#> 1           9                 0
#> 2          10                -2
#> 3          11                -1

Of course, if you want to count absences instead of attendances, just put a minus sign in front of the diff on the last line.

Upvotes: 2

Related Questions