slap-a-da-bias
slap-a-da-bias

Reputation: 406

R dplyr summarise date gaps

I have data on a set of students and the semesters they were enrolled in courses.

ID = c(1,1,1,
   2,2,
   3,3,3,3,3,
   4)

The semester variable "Date" is coded as the year followed by 20 for spring, 30 for summer, and 40 for fall. so the Date value 201430 is summer semester of 2014...

Date = c(201220,201240,201330,
     201340,201420,
     201120,201340,201420,201440,201540,
     201640)

Enrolled<-data.frame(ID,Date)

I'm using dplyr to group the data by ID and to summarise various aspects about a given student's enrollment history

Enrollment.History<-dplyr::select(Enrolled,ID,Date)%>%group_by(ID)%>%summarise(Total.Semesters = n_distinct(Date),
                                                                First.Semester = min(Date))

I'm trying to get a measure for the number of enrollment gaps that each student has, as well as the size of the largest enrollment gap. The data frame shouls end up looking like this:

Enrollment.History$Gaps<-c(2,0,3,0)
Enrollment.History$Biggest.Gap<-c(1,0,7,0)
print(Enrollment.History)

I'm just trying to figure out what the best way to code those gap variables. Is it better to turn that Date variable into an ordered factor? I hope this is a simple solution

Upvotes: 2

Views: 525

Answers (1)

talat
talat

Reputation: 70296

Since you are not dealing with real dates in a standard format, you can instead make use of factors to compute the gaps.

First you need to define a vector of all possible year/semester combinations ("Dates") in the correct order (this is important!).

all_semesters <- c(sapply(2011:2016, paste0, c(20,30,40)))

Then, you can create a new factor variable, arrange the data by ID and Date, and finally compute the maximum difference between two semesters:

Enrolled %>% 
  mutate(semester = factor(Enrolled$Date, levels = all_semesters)) %>% 
  group_by(ID) %>% 
  arrange(Date) %>% 
  summarise(max_gap = max(c(0, diff(as.integer(semester)) -1), na.rm = TRUE))

## A tibble: 4 × 2
#     ID max_gap
#  <dbl>   <dbl>
#1     1       1
#2     2       0
#3     3       7
#4     4       0

I used max(c(0, ...)) in the summarise, because otherwise you would end up with -Inf for IDs with a single entry.

Similarly, you could also achieve this by using match instead of a factor:

Enrolled %>% 
  mutate(semester = match(Date, all_semesters)) %>% 
  group_by(ID) %>% 
  arrange(Date) %>% 
  summarise(max_gap = max(c(0, diff(semester) -1), na.rm = TRUE))

Upvotes: 2

Related Questions