Reputation: 406
I have data on a set of students and the semesters they were enrolled in courses.
ID = c(1,1,1,
2,2,
3,3,3,3,3,
4)
The semester variable "Date" is coded as the year followed by 20 for spring, 30 for summer, and 40 for fall. so the Date value 201430 is summer semester of 2014...
Date = c(201220,201240,201330,
201340,201420,
201120,201340,201420,201440,201540,
201640)
Enrolled<-data.frame(ID,Date)
I'm using dplyr to group the data by ID and to summarise various aspects about a given student's enrollment history
Enrollment.History<-dplyr::select(Enrolled,ID,Date)%>%group_by(ID)%>%summarise(Total.Semesters = n_distinct(Date),
First.Semester = min(Date))
I'm trying to get a measure for the number of enrollment gaps that each student has, as well as the size of the largest enrollment gap. The data frame shouls end up looking like this:
Enrollment.History$Gaps<-c(2,0,3,0)
Enrollment.History$Biggest.Gap<-c(1,0,7,0)
print(Enrollment.History)
I'm just trying to figure out what the best way to code those gap variables. Is it better to turn that Date variable into an ordered factor? I hope this is a simple solution
Upvotes: 2
Views: 525
Reputation: 70296
Since you are not dealing with real dates in a standard format, you can instead make use of factors to compute the gaps.
First you need to define a vector of all possible year/semester combinations ("Dates") in the correct order (this is important!).
all_semesters <- c(sapply(2011:2016, paste0, c(20,30,40)))
Then, you can create a new factor variable, arrange the data by ID and Date, and finally compute the maximum difference between two semesters:
Enrolled %>%
mutate(semester = factor(Enrolled$Date, levels = all_semesters)) %>%
group_by(ID) %>%
arrange(Date) %>%
summarise(max_gap = max(c(0, diff(as.integer(semester)) -1), na.rm = TRUE))
## A tibble: 4 × 2
# ID max_gap
# <dbl> <dbl>
#1 1 1
#2 2 0
#3 3 7
#4 4 0
I used max(c(0, ...))
in the summarise, because otherwise you would end up with -Inf
for IDs with a single entry.
Similarly, you could also achieve this by using match
instead of a factor:
Enrolled %>%
mutate(semester = match(Date, all_semesters)) %>%
group_by(ID) %>%
arrange(Date) %>%
summarise(max_gap = max(c(0, diff(semester) -1), na.rm = TRUE))
Upvotes: 2