Reputation: 10199
I would like to add 1 to session
column when the difference between the first timestamp
of the same session and followings timestamp
records is more than 10 units.
in other words:
if the gap in timestamp
column is more than 10 in the same session, add 1 to the rest of the sessions for a specific ID. So we shouldn't have the same session
with the gap more than 10 in its records.
lets say:
df<-read.table(text="
ID timestamp session
1 10 1
1 12 1
1 15 1
1 21 1
1 25 1
1 27 2
1 29 2
2 11 1
2 22 2
2 27 2
2 32 2
2 42 2
2 43 3",header=T,stringsAsFactors = F)
In the example above, for ID==1
the session gap from the first record (timestamp==10) is more than 10 in row 4 (timestamp==21), so we add 1 to the rest of sessions. Whenever the session number change the difference of the first record of timestamp should be less than 10 in the same sassion, otherwise it should add to session.
result:
ID timestamp session
1 *10 1
1 12 1
1 15 1
1 *21 2 <-- because 21-10 >= 10 it add 1 to the rest of sessions in this ID
1 25 2
1 27 3
1 29 3
2 11 1
2 *22 2
2 27 2
2 *32 3 <-- because 32-22>= 10 it add 1 to the rest of session
2 *42 4 <-- because 42-32>=10
2 43 5
How can I do it in R?
Upvotes: 3
Views: 120
Reputation: 30474
Perhaps a custom function might help that calculates a cumulative sum and resets once threshold reached. In this case, if you provide the function with the session
data, it will provide a result that will include a cumulative "offset" for session, but only in rows when the session number did not increase. This addresses the case of ID
2 timestamp
22 where the difference > 10, but session number increased from 1 to 2.
library(tidyverse)
threshold <- 10
cumsum_with_reset <- function(x, session, threshold) {
cumsum <- 0
group <- 0
result <- numeric()
for (i in seq_along(x)) {
cumsum <- cumsum + x[i]
if (cumsum >= threshold) {
if (session[i] == session[i-1]) {
group <- group + 1
}
cumsum <- 0
}
result = c(result, group)
}
return (result)
}
df %>%
group_by(ID) %>%
mutate(diff = c(0, diff(timestamp)),
cumdiff = cumsum_with_reset(diff, session, threshold),
new_session = cumdiff + session)
Function adapted from this solution.
Output
ID timestamp session diff cumdiff new_session
<int> <int> <int> <dbl> <dbl> <dbl>
1 1 10 1 0 0 1
2 1 12 1 2 0 1
3 1 15 1 3 0 1
4 1 21 1 6 1 2
5 1 25 1 4 1 2
6 1 27 2 2 1 3
7 1 29 2 2 1 3
8 2 11 1 0 0 1
9 2 22 2 11 0 2
10 2 27 2 5 0 2
11 2 32 2 5 1 3
12 2 42 2 10 2 4
13 2 43 3 1 2 5
Upvotes: 1