Use python pandas or R, organize calendar data

Question

I have a data frame that records employees' attendance history and it looks like the following:

ID    Sunday Monday Tuesday Wednesday Thursday Friday Saturday
1585    NA      NA      NA      NA        NA      NA     NA
1585    NA       S       S       S         S       H     NA
1585    NA       H       S       S        NA      NA     NA
1585    NA       S       S       S        NA      NA     NA
1597    NA       S       S      NA         S      NA     NA
1597    NA      NA      NA      NA        NA       H     NA
1597    NA      H        S       S        NA      NA     NA
1597    NA      NA      NA      NA        NA      NA     NA

In the above sample, there are two individuals uniquely identified by ID, the following 7 columns are Saturday to Sunday that begins at say, April 1st, 2017. There are three attendance behaviours: S means sick leave, H stands for holidays and NA means this individual is working on that day.

The interest is to re-organize the sick leave absence records. For example, individual 1585 begins a sick leave in Monday, April 10th,2017, and ends in Wednesday, April 19th,2017, lasting 10 days. Notice that during the 10 days, there are two days of local holidays, but it would be considered belonging to this sick leave spell. Then, this person begins a second sick leave in Monday, April 24th, 2017 and ends in Wednesday, April 26th.

We also have a record about the second person with ID 1597, again begins at April 1st, 2017 (so for each person, the beginning and ending dates of the records are the same). This person has three absence spells: the first one begins in Monday, April 3rd,2017 and ends in next day, April, 4th. The second spell lasts only one day and it begins and ends at April 6th. The last spell begins in April 18th and ends in April 19th.

The desired output would be like this:

ID     Begin_date    End_date    Duration
1585   2017-04-10    2017-04-19   10
1585   2017-04-24    2017-04-26   3
1597   2017-04-03    2017-04-04   2
1597   2017-04-06    2017-04-06   1
1597   2017-04-18    2017-04-19   2

The difficulty I face is how to recognize the consecutive sick leave dates, and on top of that, during a single sick leave spell, it is possible to have different types of attendance type (holidays), but holidays are still considered to be belonged to that single sick leave spell.

skyindeer · Accepted Answer

Based on the idea of @Cholts, I write a R code for generating the desired output

#clean the workspace
rm(list=ls(all=TRUE))
require(tidyr)
library(dplyr)
library(lubridate)
library(stringr)

ID = c(rep(1585,4),rep(1597,4))
Sun = c(rep("D",8))
Sat = c(rep("D",8))
Mon = c("Y","S","H","S","S","Y","H","Y")
Tue = c("Y","S","S","S","S","Y","S","Y")
Wed = c("Y","S","S","S","Y","Y","S","Y")
Thur = c("Y","S","Y","Y","S","Y","Y","Y")
Fri = c("Y","H","Y","Y","Y","H","Y","Y")
id_u = unique(ID)
df = data.frame(Sun,Mon,Tue,Wed,Thur,Fri,Sat)

new_df = df %>% unite(new,Sun,Mon,Tue,Wed,Thur,Fri,Sat,remove=FALSE,sep="")
vstr = new_df$new


#===========================================================
idd = c()
begin_date = c()
end_date = c()
duration = c()

n = 2
start_date = ymd('2017-04-02')

for(i in 1:n){
   ps = (i-1)*4 +1
   pe = (i-1)*4 + 4
   indstr = paste(vstr[ps:pe],collapse = "")
   loca = str_locate_all(indstr,"S[SHD]*S|S")

   rn = length(loca[[1]][,1])
   for (j in 1:rn){
     idd = append(idd,id_u[i])
     begin_date = append(begin_date,ymd(start_date+loca[[1]][j,1]-1))
     end_date = append(end_date,ymd(start_date+loca[[1]][j,2]-1))
     duration = append(duration,loca[[1]][j,2]-loca[[1]][j,1]+1)
   }

}

final_df = data.frame(idd,begin_date,end_date,duration)

The output is

> final_df    
   idd begin_date   end_date duration
1 1585 2017-04-10 2017-04-19       10
2 1585 2017-04-24 2017-04-26        3
3 1597 2017-04-03 2017-04-04        2
4 1597 2017-04-06 2017-04-06        1
5 1597 2017-04-18 2017-04-19        2

Use python pandas or R, organize calendar data

Answers (2)

Related Questions