learner
learner

Reputation: 959

Group by series or sequence in R

I have following dataframe.

enter image description here

As suggested by many user I am sharing this with dput.

 > dput(time_interval_df)
structure(list(timeinterval = structure(1:24, .Label = c("00:00:00 00:59:59", 
"01:00:00 01:59:59", "02:00:00 02:59:59", "03:00:00 03:59:59", 
"04:00:00 04:59:59", "05:00:00 05:59:59", "06:00:00 06:59:59", 
"07:00:00 07:59:59", "08:00:00 08:59:59", "09:00:00 09:59:59", 
"10:00:00 10:59:59", "11:00:00 11:59:59", "12:00:00 12:59:59", 
"13:00:00 13:59:59", "14:00:00 14:59:59", "15:00:00 15:59:59", 
"16:00:00 16:59:59", "17:00:00 17:59:59", "18:00:00 18:59:59", 
"19:00:00 19:59:59", "20:00:00 20:59:59", "21:00:00 21:59:59", 
"22:00:00 22:59:59", "23:00:00 23:59:59"), class = "factor"), 
    timecount = c(1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 
    0, 0, 0, 0, 1, 0, 1, 1, 1)), .Names = c("timeinterval", "timecount"
), row.names = c(NA, -24L), class = "data.frame")

Structure of my dataframe is

str(time_interval_df)
'data.frame':   24 obs. of  2 variables:
 $ timeinterval: Factor w/ 24 levels "00:00:00 00:59:59",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ timecount   : num  1 1 1 0 0 0 0 0 1 1 ...

If the difference between the timeinterval is 1 and there is some time count then I want to group them in a single group .For example the difference b/w row no 1 and row no 2 is 1 and also the difference b/w row2 and row3 (consecutively) is 1 so I want to group all into one and display the sum of timecount in front of new row i.e. "00:00:00 02:59:59" Below is my desired output

enter image description here

After getting the second output my second desired result will be

enter image description here

Please help me on this. I have spent a lot of time in this.

Upvotes: 0

Views: 378

Answers (1)

akrun
akrun

Reputation: 887118

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1). Create two columns ('time1', 'time2') by splitting the 'timeinterval (tstrsplit()) at the whitespace, grouped by the run-length-id of 'timecount', if all the 'timecount' values are not equal to 0, then paste the first (1) and last (.N) elements of 'time1' and 'time2' respectively, and also get the number or rows (.N). If needed, assign the 'grp' column to 'NULL'

library(data.table)
res1 <- setDT(df1)[, c('time1', 'time2') := tstrsplit(timeinterval, " ")
     ][, if(all(timecount!=0)) .(timeinterval= paste(time1[1], time2[.N]), timecount = .N),
          .(grp = rleid(timecount))][, grp := NULL][]
res1
#        timeinterval timecount
#1: 00:00:00 02:59:59         3
#2: 08:00:00 09:59:59         2
#3: 14:00:00 14:59:59         1
#4: 19:00:00 19:59:59         1
#5: 21:00:00 23:59:59         3

From 'res1', we can summarise to get the 'res2'

rbind(res1[c(1, .N)][, .(timeinterval = paste(substr(timeinterval[.N], 
    1, 8), substring(timeinterval[1], 10)), timecount= sum(timecount))], res1[-c(1, .N)])
#        timeinterval timecount
#1: 21:00:00 02:59:59         6
#2: 08:00:00 09:59:59         2
#3: 14:00:00 14:59:59         1
#4: 19:00:00 19:59:59         1

Upvotes: 2

Related Questions