Reputation: 375
I want to group my dataset based on a posxict column. I need to group where the difference in time between the observations is no more than 3 hours compared to the previous observation. So this is a sample of my data:
time type day month gas response
<dttm> <chr> <int> <dbl> <chr> <dbl>
1 2018-08-21 14:30:00 old_std 21 8 benzene_area 10260721
2 2018-08-21 15:30:00 old_std 21 8 benzene_area 2591905
3 2018-11-09 20:00:00 old_std 9 11 benzene_area 684166
4 2018-11-09 23:00:00 old_std 9 11 benzene_area 589498
5 2018-11-10 02:00:00 old_std 10 11 benzene_area 460714
6 2018-11-10 05:00:00 old_std 10 11 benzene_area 490663
7 2018-11-10 11:30:00 old_std 10 11 benzene_area 404947
8 2018-11-10 14:30:00 old_std 10 11 benzene_area 402566
9 2018-11-10 16:30:00 old_std 10 11 benzene_area 362376
10 2018-11-11 00:00:00 old_std 11 11 benzene_area 276184
And this is what I am looking for:
time type day month gas response group
<dttm> <chr> <int> <dbl> <chr> <dbl>
1 2018-08-21 14:30:00 old_std 21 8 benzene_area 10260721 1
2 2018-08-21 15:30:00 old_std 21 8 benzene_area 2591905 1
3 2018-11-09 20:00:00 old_std 9 11 benzene_area 684166 2
4 2018-11-09 23:00:00 old_std 9 11 benzene_area 589498 2
5 2018-11-10 02:00:00 old_std 10 11 benzene_area 460714 2
6 2018-11-10 05:00:00 old_std 10 11 benzene_area 490663 2
7 2018-11-10 11:30:00 old_std 10 11 benzene_area 404947 3
8 2018-11-10 14:30:00 old_std 10 11 benzene_area 402566 3
9 2018-11-10 16:30:00 old_std 10 11 benzene_area 362376 3
10 2018-11-11 00:00:00 old_std 11 11 benzene_area 276184 4
I hadn't found a way to group the data like this. Any ideas?
Upvotes: 0
Views: 177
Reputation: 389235
You can use lag
to get previous value and use it in difftime
to get difference in hours and increment the group value everytime the difference is greater than 3 hours.
library(dplyr)
df %>%
mutate(group = cumsum(difftime(time, lag(time, default = first(time)),
units = "hours") > 3) + 1)
# time type day month gas response group
#1 2018-08-21 14:30:00 old_std 21 8 benzene_area 10260721 1
#2 2018-08-21 15:30:00 old_std 21 8 benzene_area 2591905 1
#3 2018-11-09 20:00:00 old_std 9 11 benzene_area 684166 2
#4 2018-11-09 23:00:00 old_std 9 11 benzene_area 589498 2
#5 2018-11-10 02:00:00 old_std 10 11 benzene_area 460714 2
#6 2018-11-10 05:00:00 old_std 10 11 benzene_area 490663 2
#7 2018-11-10 11:30:00 old_std 10 11 benzene_area 404947 3
#8 2018-11-10 14:30:00 old_std 10 11 benzene_area 402566 3
#9 2018-11-10 16:30:00 old_std 10 11 benzene_area 362376 3
#10 2018-11-11 00:00:00 old_std 11 11 benzene_area 276184 4
In base R, that would be :
df$group <- cumsum(c(TRUE, difftime(df$time[-1], df$time[-nrow(df)],
units = "hours") > 3))
Upvotes: 3