Reputation: 7526
With the following table dat
, my objective is to group by user_id
and mobile_id
only where there is a continuous sequence of values where difftime > - 600
. The sequence must be consecutive in created_at
, and given a rank. Each separate group would be assigned an incremental value, For example :
> dat
created_at user_id mobile_id status difftime
1 2019-01-02 22:01:38 1227604 68409 finished \\N
2 2019-01-03 04:08:29 1227604 68409 finished -366
3 2019-01-03 15:16:38 1227604 68409 timeout -668
4 2019-01-04 00:34:40 1227604 68409 failed -558
5 2019-01-04 00:27:37 1227605 68453 failed \\N
6 2019-01-04 00:35:56 1227605 68453 finished -8
7 2019-01-04 01:39:52 1227605 68453 finished -63
8 2019-01-04 02:05:53 1227605 68453 timeout -26
9 2019-01-04 02:17:17 1227605 68453 timeout -11
10 2019-01-04 16:51:39 1227605 68453 timeout -874
Would create an output of
> output
created_at user_id mobile_id status difftime group rank
1 2019-01-02 22:01:38 1227604 68409 finished \\N NA NA
2 2019-01-03 04:08:29 1227604 68409 finished -366 1 1
3 2019-01-03 15:16:38 1227604 68409 timeout -668 NA NA
4 2019-01-04 00:34:40 1227604 68409 failed -558 2 1
5 2019-01-04 00:27:37 1227605 68453 failed \\N NA NA
6 2019-01-04 00:35:56 1227605 68453 finished -8 3 1
7 2019-01-04 01:39:52 1227605 68453 finished -63 3 2
8 2019-01-04 02:05:53 1227605 68453 timeout -26 3 3
9 2019-01-04 02:17:17 1227605 68453 timeout -11 3 4
10 2019-01-04 16:51:39 1227605 68453 timeout -874 NA NA
I am not sure where to begin, beyond a simple grouping in dplyr
. How would one go about assigning a group and rank ?
dat %>%
group_by(user_id, mobile_id) %>%
arrange(created_at) %>%
filter(difftime > -600)
The data:
> dput(dat)
structure(list(created_at = structure(c(1546466498.138, 1546488509.218,
1546528598.628, 1546562080.81, 1546561657.567, 1546562156.632,
1546565992.788, 1546567553.811, 1546568237.325, 1546620699.964
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), user_id = c(1227604,
1227604, 1227604, 1227604, 1227605, 1227605, 1227605, 1227605,
1227605, 1227605), mobile_id = c(68409L, 68409L, 68409L, 68409L,
68453L, 68453L, 68453L, 68453L, 68453L, 68453L), status = c("finished",
"finished", "timeout", "failed", "failed", "finished", "finished",
"timeout", "timeout", "timeout"), difftime = c(NA, -366, -668,
-558, NA, -8, -63, -26, -11, -874), group = c(NA, 1, NA, 2, NA,
3, 3, 3, 3, NA), rank = c(NA, 1, NA, 1, NA, 1, 2, 3, 4, NA)), row.names = c(NA,
-10L), class = "data.frame")
Upvotes: 0
Views: 69
Reputation: 51
You can use cumsum
to define a variable that increases when the observations are not consecutive based on created_at
within the same group. By grouping on this new variable, too, it is easy to create the rank indices:
library("dplyr")
library("tidyr") ## for replace_na
dat2 <- dat %>%
group_by(user_id, mobile_id) %>%
arrange(created_at, .by_group = TRUE) %>% ## grouped arrange
mutate(d = cumsum(replace_na(difftime < -600, 0))) %>%
group_by(user_id, mobile_id, d) %>%
mutate(rank = row_number()-1) ## rank id
Then the easiest way to create group indices is with dplyr::group_indices
:
dat2$group <- group_indices(dat2 %>% ungroup, user_id, mobile_id, d)
I'm not sure why you would want to set the first instances of the indicators to NA
but you can do it based on the values of rank
.
> mutate(dat2, group = ifelse(rank == 0, NA, group),
+ rank = ifelse(rank == 0, NA, rank))
# A tibble: 10 x 8
# Groups: user_id, mobile_id, d [4]
created_at user_id mobile_id status difftime group rank d
<dttm> <dbl> <int> <chr> <dbl> <int> <dbl> <dbl>
1 2019-01-02 22:01:38 1227604. 68409 finished NA NA NA 0.
2 2019-01-03 04:08:29 1227604. 68409 finished -366. 1 1. 0.
3 2019-01-03 15:16:38 1227604. 68409 timeout -668. NA NA 1.
4 2019-01-04 00:34:40 1227604. 68409 failed -558. 2 1. 1.
5 2019-01-04 00:27:37 1227605. 68453 failed NA NA NA 0.
6 2019-01-04 00:35:56 1227605. 68453 finished -8. 3 1. 0.
7 2019-01-04 01:39:52 1227605. 68453 finished -63. 3 2. 0.
8 2019-01-04 02:05:53 1227605. 68453 timeout -26. 3 3. 0.
9 2019-01-04 02:17:17 1227605. 68453 timeout -11. 3 4. 0.
10 2019-01-04 16:51:39 1227605. 68453 timeout -874. NA NA 1.
Upvotes: 1