Reputation: 175
I have a data frame that records employees' attendance history and it looks like the following:
ID Sunday Monday Tuesday Wednesday Thursday Friday Saturday
1585 NA NA NA NA NA NA NA
1585 NA S S S S H NA
1585 NA H S S NA NA NA
1585 NA S S S NA NA NA
1597 NA S S NA S NA NA
1597 NA NA NA NA NA H NA
1597 NA H S S NA NA NA
1597 NA NA NA NA NA NA NA
In the above sample, there are two individuals uniquely identified by ID, the following 7 columns are Saturday to Sunday that begins at say, April 1st, 2017. There are three attendance behaviours: S
means sick leave, H
stands for holidays and NA
means this individual is working on that day.
The interest is to re-organize the sick leave absence records. For example, individual 1585 begins a sick leave in Monday, April 10th,2017, and ends in Wednesday, April 19th,2017, lasting 10 days. Notice that during the 10 days, there are two days of local holidays, but it would be considered belonging to this sick leave spell. Then, this person begins a second sick leave in Monday, April 24th, 2017 and ends in Wednesday, April 26th.
We also have a record about the second person with ID 1597, again begins at April 1st, 2017 (so for each person, the beginning and ending dates of the records are the same). This person has three absence spells: the first one begins in Monday, April 3rd,2017 and ends in next day, April, 4th. The second spell lasts only one day and it begins and ends at April 6th. The last spell begins in April 18th and ends in April 19th.
The desired output would be like this:
ID Begin_date End_date Duration
1585 2017-04-10 2017-04-19 10
1585 2017-04-24 2017-04-26 3
1597 2017-04-03 2017-04-04 2
1597 2017-04-06 2017-04-06 1
1597 2017-04-18 2017-04-19 2
The difficulty I face is how to recognize the consecutive sick leave dates, and on top of that, during a single sick leave spell, it is possible to have different types of attendance type (holidays), but holidays are still considered to be belonged to that single sick leave spell.
Upvotes: 0
Views: 187
Reputation: 175
Based on the idea of @Cholts, I write a R code for generating the desired output
#clean the workspace
rm(list=ls(all=TRUE))
require(tidyr)
library(dplyr)
library(lubridate)
library(stringr)
ID = c(rep(1585,4),rep(1597,4))
Sun = c(rep("D",8))
Sat = c(rep("D",8))
Mon = c("Y","S","H","S","S","Y","H","Y")
Tue = c("Y","S","S","S","S","Y","S","Y")
Wed = c("Y","S","S","S","Y","Y","S","Y")
Thur = c("Y","S","Y","Y","S","Y","Y","Y")
Fri = c("Y","H","Y","Y","Y","H","Y","Y")
id_u = unique(ID)
df = data.frame(Sun,Mon,Tue,Wed,Thur,Fri,Sat)
new_df = df %>% unite(new,Sun,Mon,Tue,Wed,Thur,Fri,Sat,remove=FALSE,sep="")
vstr = new_df$new
#===========================================================
idd = c()
begin_date = c()
end_date = c()
duration = c()
n = 2
start_date = ymd('2017-04-02')
for(i in 1:n){
ps = (i-1)*4 +1
pe = (i-1)*4 + 4
indstr = paste(vstr[ps:pe],collapse = "")
loca = str_locate_all(indstr,"S[SHD]*S|S")
rn = length(loca[[1]][,1])
for (j in 1:rn){
idd = append(idd,id_u[i])
begin_date = append(begin_date,ymd(start_date+loca[[1]][j,1]-1))
end_date = append(end_date,ymd(start_date+loca[[1]][j,2]-1))
duration = append(duration,loca[[1]][j,2]-loca[[1]][j,1]+1)
}
}
final_df = data.frame(idd,begin_date,end_date,duration)
The output is
> final_df
idd begin_date end_date duration
1 1585 2017-04-10 2017-04-19 10
2 1585 2017-04-24 2017-04-26 3
3 1597 2017-04-03 2017-04-04 2
4 1597 2017-04-06 2017-04-06 1
5 1597 2017-04-18 2017-04-19 2
Upvotes: 1
Reputation: 101
You mention that one of your biggest problems is recognizing sick leave spells given that they can include holidays and weekends. It occurs to me that a sick leave spell could be represented as a regex. Here is a potential solution that takes advantage of this:
First, replace all the NAs on weekends with 'D' and all NAs on weekdays with 'Y' (or some other 2 characters). In your example, normal weekends are treated differently from normal weekdays in that absence spells can contain normal weekends but not normal weekdays, so they should have a different value.
Then, for each ID, flatten the data, and concatenate it into a string. Absence spells can be identified by regex as "(S[SHD]*S)|S"
. For each found regex string, create a row in your new table where the begin date, end date, and duration of the absence spell are based on the start position, end position, and length of the found string.
For this solution to work, we need to assume that each ID has the same number of rows corresponding to the same dates, so that we can map start positions of the strings to dates properly.
Upvotes: 1