Reputation: 1869
I have found variations of this question, and I know modulos can possibly be used but I am having a difficult time putting it all together.
I have a sequence of observations by ID and seconds. When the cumulitive amount of seconds by id increments greater than 5 seconds, I would like to restart the count. Could someone help me answer this question in dplyr?
Original df
df <- data.frame(id = c(1,1,1,1,1,2,2,2,2,3,3,3,3),
val = c(2,10,12,15,17,2,4,7,8,12,15,20,25))
df
id val
1 1 2
2 1 10
3 1 12
4 1 15
5 1 17
6 2 2
7 2 4
8 2 7
9 2 8
10 3 12
11 3 15
12 3 20
13 3 25
Desired Outcome
finalResult
id val reset
1 1 2 1
2 1 10 2
3 1 12 2
4 1 15 3
5 1 17 3
6 2 2 1
7 2 4 1
8 2 7 2
9 2 8 2
10 3 12 1
11 3 15 1
12 3 20 2
13 3 25 3
Edit
Thanks for the responses yesterday but I encountered some problems with the given solutions.
On this data set the code works on some instances.
sub.df <- structure(list(`ID` = c("1",
"1", "1",
"1", "1",
"1", "1",
"1", "1"
), dateFormat = structure(c(1479955726, 1479955726, 1483703713,
1495190809, 1495190809, 1497265079, 1497265079, 1474023059, 1474023061
), class = c("POSIXct", "POSIXt"), tzone = "America/Chicago")), .Names = c("ID",
"dateFormat"), row.names = c(NA, -9L), class = c("tbl_df", "tbl",
"data.frame"))
Solution Used:
jj <- sub.df %>%
group_by(`ID`) %>%
arrange(`ID`,`dateFormat`)%>%
mutate(totalTimeInt = difftime(dateFormat,first(dateFormat),units = 'secs'))%>%
mutate(totalTimeFormat = as.numeric(totalTimeInt))%>%
mutate(reset = cumsum(
Reduce(
function(x, y)
if (x + y >= 5) 0
else x + y,
diff(totalTimeFormat), init = 0, accumulate = TRUE
) == 0
))%>%
mutate(reset_2 = cumsum(
accumulate(
diff(totalTimeFormat),
~if (.x + .y >= 5) 0 else .x + .y,
.init = 0
) == 0
))
Outcome
# A tibble: 9 x 6
# Groups: ID [1]
ID dateFormat totalTimeInt totalTimeFormat reset reset_2
<chr> <dttm> <time> <dbl> <int> <int>
1 1 2016-09-16 05:50:59 0 secs 0 1 1
2 1 2016-09-16 05:51:01 2 secs 2 1 1
3 1 2016-11-23 20:48:46 5932667 secs 5932667 2 2
4 1 2016-11-23 20:48:46 5932667 secs 5932667 3 3
5 1 2017-01-06 05:55:13 9680654 secs 9680654 4 4
6 1 2017-05-19 05:46:49 21167750 secs 21167750 5 5
7 1 2017-05-19 05:46:49 21167750 secs 21167750 6 6
8 1 2017-06-12 05:57:59 23242020 secs 23242020 7 7
9 1 2017-06-12 05:57:59 23242020 secs 23242020 8 8
What happens is that for the first two observation it correctly counts that as 1 instance. When it reaches the third and fourth observation this should only be counted as two observations as there was essentially no time that passed between those two instances.
Correct Output:
# A tibble: 9 x 6
# Groups: ID [1]
ID dateFormat totalTimeInt totalTimeFormat reset reset_2
<chr> <dttm> <time> <dbl> <int> <int>
1 1 2016-09-16 05:50:59 0 secs 0 1 1
2 1 2016-09-16 05:51:01 2 secs 2 1 1
3 1 2016-11-23 20:48:46 5932667 secs 5932667 2 2
4 1 2016-11-23 20:48:46 5932667 secs 5932667 2 2
5 1 2017-01-06 05:55:13 9680654 secs 9680654 3 3
6 1 2017-05-19 05:46:49 21167750 secs 21167750 4 4
7 1 2017-05-19 05:46:49 21167750 secs 21167750 4 4
8 1 2017-06-12 05:57:59 23242020 secs 23242020 5 5
9 1 2017-06-12 05:57:59 23242020 secs 23242020 5 5
Upvotes: 5
Views: 1373
Reputation: 43334
If you use Reduce
with accumulate = TRUE
(or purrr::accumulate
, if you prefer), you can reset the running difference when it is greater than or equal to 5. Calling cumsum
on whether that total is 0 will return the number of resets.
library(tidyverse)
df <- data.frame(id = c(1,1,1,1,1,2,2,2,2,3,3,3,3),
val = c(2,10,12,15,17,2,4,7,8,12,15,20,25))
df %>%
group_by(id) %>%
mutate(reset = cumsum(
Reduce(
function(x, y) if (x + y >= 5) 0 else x + y,
diff(val), init = 0, accumulate = TRUE
) == 0
))
#> # A tibble: 13 x 3
#> # Groups: id [3]
#> id val reset
#> <dbl> <dbl> <int>
#> 1 1 2 1
#> 2 1 10 2
#> 3 1 12 2
#> 4 1 15 3
#> 5 1 17 3
#> 6 2 2 1
#> 7 2 4 1
#> 8 2 7 2
#> 9 2 8 2
#> 10 3 12 1
#> 11 3 15 1
#> 12 3 20 2
#> 13 3 25 3
or with purrr::accumulate
,
df %>%
group_by(id) %>%
mutate(reset = cumsum(
accumulate(
diff(val),
~if (.x + .y >= 5) 0 else .x + .y,
.init = 0
) == 0
))
#> # A tibble: 13 x 3
#> # Groups: id [3]
#> id val reset
#> <dbl> <dbl> <int>
#> 1 1 2 1
#> 2 1 10 2
#> 3 1 12 2
#> 4 1 15 3
#> 5 1 17 3
#> 6 2 2 1
#> 7 2 4 1
#> 8 2 7 2
#> 9 2 8 2
#> 10 3 12 1
#> 11 3 15 1
#> 12 3 20 2
#> 13 3 25 3
Regarding the edit, the issue is that some of the diffs are 0, which is the same as what it's counting to see resets. The simplest solution is to use NA
instead of zero as a reset value:
library(tidyverse)
sub.df <- structure(list(`ID` = c("1", "1", "1", "1", "1", "1", "1", "1", "1"),
dateFormat = structure(c(1479955726, 1479955726, 1483703713,
1495190809, 1495190809, 1497265079, 1497265079, 1474023059, 1474023061),
class = c("POSIXct", "POSIXt"), tzone = "America/Chicago")),
.Names = c("ID", "dateFormat"), row.names = c(NA, -9L),
class = c("tbl_df", "tbl", "data.frame"))
sub.df %>%
group_by(ID) %>%
arrange(ID, dateFormat) %>%
mutate(reset = cumsum(is.na(
accumulate(diff(dateFormat),
~{
s <- sum(.x, .y, na.rm = TRUE);
if (s >= 5) NA else s
},
.init = NA)
)))
#> # A tibble: 9 x 3
#> # Groups: ID [1]
#> ID dateFormat reset
#> <chr> <dttm> <int>
#> 1 1 2016-09-16 05:50:59 1
#> 2 1 2016-09-16 05:51:01 1
#> 3 1 2016-11-23 20:48:46 2
#> 4 1 2016-11-23 20:48:46 2
#> 5 1 2017-01-06 05:55:13 3
#> 6 1 2017-05-19 05:46:49 4
#> 7 1 2017-05-19 05:46:49 4
#> 8 1 2017-06-12 05:57:59 5
#> 9 1 2017-06-12 05:57:59 5
Ultimately this approach faces limitations, too, though, as if any values actually are NA
, it will increment similarly. A more robust solution would be to return a list of two elements from each iteration, one for the total with resets, and one for the reset count. This is more work to implement, though:
sub.df %>%
group_by(ID) %>%
arrange(ID, dateFormat) %>%
mutate(total_reset = accumulate(
transpose(list(total = diff(dateFormat), reset = rep(0, n() - 1))),
~{
s <- .x$total + .y$total;
if (s >= 5) {
data_frame(total = 0, reset = .x$reset + 1)
} else {
data_frame(total = s, reset = .x$reset)
}
},
.init = data_frame(total = 0, reset = 1)
)) %>%
unnest()
#> # A tibble: 9 x 4
#> # Groups: ID [1]
#> ID dateFormat total reset
#> <chr> <dttm> <dbl> <dbl>
#> 1 1 2016-09-16 05:50:59 0 1
#> 2 1 2016-09-16 05:51:01 2 1
#> 3 1 2016-11-23 20:48:46 0 2
#> 4 1 2016-11-23 20:48:46 0 2
#> 5 1 2017-01-06 05:55:13 0 3
#> 6 1 2017-05-19 05:46:49 0 4
#> 7 1 2017-05-19 05:46:49 0 4
#> 8 1 2017-06-12 05:57:59 0 5
#> 9 1 2017-06-12 05:57:59 0 5
The total looks a little silly, but if you look at the diff, it's actually correct.
Upvotes: 4
Reputation: 16920
I might be wrong (EDIT: I was proven wrong, by alistaire's brilliant answer, though I'm leaving this approach here for now), but I think this is one of the instances where you actually need a loop because the value of reset
in each row is going to depend on what happened for previous rows. I'm hopeful Joseph Wood will come up with something smarter than this, but in the meantime here is a naive approach, which utilizes dplyr
as requested. We can make the following function
count_resets <- function(x) {
N <- length(x)
value <- 1
result <- rep(1, N)
threshold <- x[1]
for ( i in 2:N ) {
if ( abs(x[i] - threshold) >= 5) {
value <- value + 1
threshold <- x[i]
}
result[i] <- value
}
return(result)
}
And apply it by id
using dplyr
's group_by()
:
library(dplyr)
df %>%
group_by(id) %>%
mutate(reset = count_resets(val))
# A tibble: 13 x 3
# Groups: id [3]
id val reset
<dbl> <dbl> <dbl>
1 1 2 1
2 1 10 2
3 1 12 2
4 1 15 3
5 1 17 3
6 2 2 1
7 2 4 1
8 2 7 2
9 2 8 2
10 3 12 1
11 3 15 1
12 3 20 2
13 3 25 3
Upvotes: 2