skyindeer
skyindeer

Reputation: 175

Use python pandas or R, organize calendar data

I have a data frame that records employees' attendance history and it looks like the following:

ID    Sunday Monday Tuesday Wednesday Thursday Friday Saturday
1585    NA      NA      NA      NA        NA      NA     NA
1585    NA       S       S       S         S       H     NA
1585    NA       H       S       S        NA      NA     NA
1585    NA       S       S       S        NA      NA     NA
1597    NA       S       S      NA         S      NA     NA
1597    NA      NA      NA      NA        NA       H     NA
1597    NA      H        S       S        NA      NA     NA
1597    NA      NA      NA      NA        NA      NA     NA

In the above sample, there are two individuals uniquely identified by ID, the following 7 columns are Saturday to Sunday that begins at say, April 1st, 2017. There are three attendance behaviours: S means sick leave, H stands for holidays and NA means this individual is working on that day.

The interest is to re-organize the sick leave absence records. For example, individual 1585 begins a sick leave in Monday, April 10th,2017, and ends in Wednesday, April 19th,2017, lasting 10 days. Notice that during the 10 days, there are two days of local holidays, but it would be considered belonging to this sick leave spell. Then, this person begins a second sick leave in Monday, April 24th, 2017 and ends in Wednesday, April 26th.

We also have a record about the second person with ID 1597, again begins at April 1st, 2017 (so for each person, the beginning and ending dates of the records are the same). This person has three absence spells: the first one begins in Monday, April 3rd,2017 and ends in next day, April, 4th. The second spell lasts only one day and it begins and ends at April 6th. The last spell begins in April 18th and ends in April 19th.

The desired output would be like this:

ID     Begin_date    End_date    Duration
1585   2017-04-10    2017-04-19   10
1585   2017-04-24    2017-04-26   3
1597   2017-04-03    2017-04-04   2
1597   2017-04-06    2017-04-06   1
1597   2017-04-18    2017-04-19   2

The difficulty I face is how to recognize the consecutive sick leave dates, and on top of that, during a single sick leave spell, it is possible to have different types of attendance type (holidays), but holidays are still considered to be belonged to that single sick leave spell.

Upvotes: 0

Views: 187

Answers (2)

skyindeer
skyindeer

Reputation: 175

Based on the idea of @Cholts, I write a R code for generating the desired output

#clean the workspace
rm(list=ls(all=TRUE))
require(tidyr)
library(dplyr)
library(lubridate)
library(stringr)

ID = c(rep(1585,4),rep(1597,4))
Sun = c(rep("D",8))
Sat = c(rep("D",8))
Mon = c("Y","S","H","S","S","Y","H","Y")
Tue = c("Y","S","S","S","S","Y","S","Y")
Wed = c("Y","S","S","S","Y","Y","S","Y")
Thur = c("Y","S","Y","Y","S","Y","Y","Y")
Fri = c("Y","H","Y","Y","Y","H","Y","Y")
id_u = unique(ID)
df = data.frame(Sun,Mon,Tue,Wed,Thur,Fri,Sat)

new_df = df %>% unite(new,Sun,Mon,Tue,Wed,Thur,Fri,Sat,remove=FALSE,sep="")
vstr = new_df$new


#===========================================================
idd = c()
begin_date = c()
end_date = c()
duration = c()

n = 2
start_date = ymd('2017-04-02')

for(i in 1:n){
   ps = (i-1)*4 +1
   pe = (i-1)*4 + 4
   indstr = paste(vstr[ps:pe],collapse = "")
   loca = str_locate_all(indstr,"S[SHD]*S|S")

   rn = length(loca[[1]][,1])
   for (j in 1:rn){
     idd = append(idd,id_u[i])
     begin_date = append(begin_date,ymd(start_date+loca[[1]][j,1]-1))
     end_date = append(end_date,ymd(start_date+loca[[1]][j,2]-1))
     duration = append(duration,loca[[1]][j,2]-loca[[1]][j,1]+1)
   }

}

final_df = data.frame(idd,begin_date,end_date,duration)

The output is

> final_df    
   idd begin_date   end_date duration
1 1585 2017-04-10 2017-04-19       10
2 1585 2017-04-24 2017-04-26        3
3 1597 2017-04-03 2017-04-04        2
4 1597 2017-04-06 2017-04-06        1
5 1597 2017-04-18 2017-04-19        2

Upvotes: 1

Cholts
Cholts

Reputation: 101

You mention that one of your biggest problems is recognizing sick leave spells given that they can include holidays and weekends. It occurs to me that a sick leave spell could be represented as a regex. Here is a potential solution that takes advantage of this:

First, replace all the NAs on weekends with 'D' and all NAs on weekdays with 'Y' (or some other 2 characters). In your example, normal weekends are treated differently from normal weekdays in that absence spells can contain normal weekends but not normal weekdays, so they should have a different value.

Then, for each ID, flatten the data, and concatenate it into a string. Absence spells can be identified by regex as "(S[SHD]*S)|S". For each found regex string, create a row in your new table where the begin date, end date, and duration of the absence spell are based on the start position, end position, and length of the found string.

For this solution to work, we need to assume that each ID has the same number of rows corresponding to the same dates, so that we can map start positions of the strings to dates properly.

Upvotes: 1

Related Questions