Group by series or sequence in R

Question

I have following dataframe.

As suggested by many user I am sharing this with dput.

 > dput(time_interval_df)
structure(list(timeinterval = structure(1:24, .Label = c("00:00:00 00:59:59", 
"01:00:00 01:59:59", "02:00:00 02:59:59", "03:00:00 03:59:59", 
"04:00:00 04:59:59", "05:00:00 05:59:59", "06:00:00 06:59:59", 
"07:00:00 07:59:59", "08:00:00 08:59:59", "09:00:00 09:59:59", 
"10:00:00 10:59:59", "11:00:00 11:59:59", "12:00:00 12:59:59", 
"13:00:00 13:59:59", "14:00:00 14:59:59", "15:00:00 15:59:59", 
"16:00:00 16:59:59", "17:00:00 17:59:59", "18:00:00 18:59:59", 
"19:00:00 19:59:59", "20:00:00 20:59:59", "21:00:00 21:59:59", 
"22:00:00 22:59:59", "23:00:00 23:59:59"), class = "factor"), 
    timecount = c(1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 
    0, 0, 0, 0, 1, 0, 1, 1, 1)), .Names = c("timeinterval", "timecount"
), row.names = c(NA, -24L), class = "data.frame")

Structure of my dataframe is

str(time_interval_df)
'data.frame':   24 obs. of  2 variables:
 $ timeinterval: Factor w/ 24 levels "00:00:00 00:59:59",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ timecount   : num  1 1 1 0 0 0 0 0 1 1 ...

If the difference between the timeinterval is 1 and there is some time count then I want to group them in a single group .For example the difference b/w row no 1 and row no 2 is 1 and also the difference b/w row2 and row3 (consecutively) is 1 so I want to group all into one and display the sum of timecount in front of new row i.e. "00:00:00 02:59:59" Below is my desired output

After getting the second output my second desired result will be

Please help me on this. I have spent a lot of time in this.

akrun · Accepted Answer

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1). Create two columns ('time1', 'time2') by splitting the 'timeinterval (tstrsplit()) at the whitespace, grouped by the run-length-id of 'timecount', if all the 'timecount' values are not equal to 0, then paste the first (1) and last (.N) elements of 'time1' and 'time2' respectively, and also get the number or rows (.N). If needed, assign the 'grp' column to 'NULL'

library(data.table)
res1 <- setDT(df1)[, c('time1', 'time2') := tstrsplit(timeinterval, " ")
     ][, if(all(timecount!=0)) .(timeinterval= paste(time1[1], time2[.N]), timecount = .N),
          .(grp = rleid(timecount))][, grp := NULL][]
res1
#        timeinterval timecount
#1: 00:00:00 02:59:59         3
#2: 08:00:00 09:59:59         2
#3: 14:00:00 14:59:59         1
#4: 19:00:00 19:59:59         1
#5: 21:00:00 23:59:59         3

From 'res1', we can summarise to get the 'res2'

rbind(res1[c(1, .N)][, .(timeinterval = paste(substr(timeinterval[.N], 
    1, 8), substring(timeinterval[1], 10)), timecount= sum(timecount))], res1[-c(1, .N)])
#        timeinterval timecount
#1: 21:00:00 02:59:59         6
#2: 08:00:00 09:59:59         2
#3: 14:00:00 14:59:59         1
#4: 19:00:00 19:59:59         1

Group by series or sequence in R

Answers (1)

Related Questions