Reputation: 529
I have dataset like this
id data time moreData
<int> <int> <dttm> <dbl>
1 1 4 2017-05-12 18:34:20 4450
2 2 4 2017-05-12 18:37:07 2800
3 3 4 2017-05-12 18:37:10 1900
4 4 4 2017-05-12 18:37:59 1950
5 5 4 2017-05-12 18:38:40 2500
containing timestamps. You could say this data are "requests to a website" and i want to approximate "sessions".
In other words, I wish to group the rows 1, 2 , ... , n in groups, if time difference between row i and i+1 is less then let's say less than 1 minute.
Therefor, the data would be grouped in {1} and {2,3,4,5}.
Please note this is not a duplicate question of other questions asking about grouping in predetermined time intervals - I do not care how big the time difference between first and last element is, I care only about difference in adjacent rows.
How can I achieve this?
Sample data:
structure(list(id = 1:20, user = c(4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), time = structure(c(1494606860,
1494607027, 1494607030, 1494607172, 1494607173, 1494607197, 1494607198,
1494607200, 1494607309, 1494607312, 1494607339, 1494607340, 1494607343,
1494607343, 1494607404, 1494607405, 1494607407, 1494607492, 1494607493,
1494607495), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("id",
"user", "time"), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
Upvotes: 2
Views: 112
Reputation: 20095
One possible solution could be using lag
function from dplyr
package and cumsum
form base r
.
The approach is:
difftime
is more than 60 than that row is in new group (newgroup
)cumsum
on newgroup
to get group number of each row.The code is:
#data
library(dplyr)
df <- structure(list(id = 1:20, user = c(4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L),
time = structure(c(1494606860,1494607027, 1494607030, 1494607172, 1494607173, 1494607197, 1494607198,
1494607200, 1494607309, 1494607312, 1494607339, 1494607340, 1494607343,
1494607343, 1494607404, 1494607405, 1494607407, 1494607492, 1494607493,
1494607495), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("id",
"user", "time"), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
df %>% mutate(difftime = ifelse(is.na(as.numeric(time - lag(time))),0,as.numeric(time - lag(time)))) %>%
mutate(newroup = ifelse(difftime > 60, 1, 0)) %>%
mutate(group = factor(cumsum(newroup))) %>%
select(id, user,time, group)
#Result
id user time group
<int> <int> <dttm> <fctr>
1 1 4 2017-05-12 17:34:20 0
2 2 4 2017-05-12 17:37:07 1
3 3 4 2017-05-12 17:37:10 1
4 4 4 2017-05-12 17:39:32 2
5 5 4 2017-05-12 17:39:33 2
6 6 4 2017-05-12 17:39:57 2
7 7 4 2017-05-12 17:39:58 2
8 8 4 2017-05-12 17:40:00 2
9 9 4 2017-05-12 17:41:49 3
10 10 4 2017-05-12 17:41:52 3
11 11 4 2017-05-12 17:42:19 3
12 12 4 2017-05-12 17:42:20 3
13 13 4 2017-05-12 17:42:23 3
14 14 4 2017-05-12 17:42:23 3
15 15 4 2017-05-12 17:43:24 4
16 16 4 2017-05-12 17:43:25 4
17 17 4 2017-05-12 17:43:27 4
18 18 4 2017-05-12 17:44:52 5
19 19 4 2017-05-12 17:44:53 5
20 20 4 2017-05-12 17:44:55 5
Upvotes: 1
Reputation: 28339
You can use difftime
function from base
R
.
Code:
# Wanted time difference in minutes
wantedDiff <- 1
timeDiff <- abs(difftime(df$time[-nrow(df)],
df$time[-1],
units = "mins"))
df$group <- cumsum(c(0, as.numeric(timeDiff >= wantedDiff)))
Result:
id user time group 1 1 4 2017-05-12 19:34:20 0 2 2 4 2017-05-12 19:37:07 1 3 3 4 2017-05-12 19:37:10 1 4 4 4 2017-05-12 19:39:32 2 5 5 4 2017-05-12 19:39:33 2 6 6 4 2017-05-12 19:39:57 2 7 7 4 2017-05-12 19:39:58 2 8 8 4 2017-05-12 19:40:00 2 9 9 4 2017-05-12 19:41:49 3 10 10 4 2017-05-12 19:41:52 3 11 11 4 2017-05-12 19:42:19 3 12 12 4 2017-05-12 19:42:20 3 13 13 4 2017-05-12 19:42:23 3 14 14 4 2017-05-12 19:42:23 3 15 15 4 2017-05-12 19:43:24 4 16 16 4 2017-05-12 19:43:25 4 17 17 4 2017-05-12 19:43:27 4 18 18 4 2017-05-12 19:44:52 5 19 19 4 2017-05-12 19:44:53 5 20 20 4 2017-05-12 19:44:55 5
Explanation:
difftime
units
heretimeDiff
) looks like this:Time differences in mins [1] 2.78333333 0.05000000 2.36666667 0.01666667 0.40000000 0.01666667 0.03333333 1.81666667 0.05000000 0.45000000 [11] 0.01666667 0.05000000 0.00000000 1.01666667 0.01666667 0.03333333 1.41666667 0.01666667 0.03333333
wantedDiff
and transform this logical output to numericcumsum
that numeric output (adds +1
, ie switches to new group)Data:
df <- structure(list(id = 1:20, user = c(4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), time = structure(c(1494606860,
1494607027, 1494607030, 1494607172, 1494607173, 1494607197, 1494607198,
1494607200, 1494607309, 1494607312, 1494607339, 1494607340, 1494607343,
1494607343, 1494607404, 1494607405, 1494607407, 1494607492, 1494607493,
1494607495), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("id",
"user", "time"), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
Upvotes: 2
Reputation: 15072
Here is a solution using an expanded set of your example data. The key parts of this approach are using lubridate::ymd_hms
to convert strings into times that you can do arithmetic with, and then lag
to figure out whether or not time is within a minute of the previous row. Then, you can use a for
loop to create the groups by incrementing the group number every time you get to a row that is not within one minute of the previous row. Can certainly be neatened up a little, and would love to see if anyone can do this without resorting to a for
loop and bind_cols
!
library(tidyverse)
tbl <- tibble(
id = 1:8,
time = c("2017-05-12 18:34:20",
"2017-05-12 18:37:07",
"2017-05-12 18:37:10",
"2017-05-12 18:37:59",
"2017-05-12 18:38:40",
"2017-05-12 18:40:40",
"2017-05-12 18:40:49",
"2017-05-12 18:43:40"
)
)
tbl2 <- tbl %>%
mutate(time = ymd_hms(time)) %>%
mutate(separation = time - lag(time, default = 0)) %>%
mutate(onemin = separation <= 60)
group_ids = 1
for (i in 2:nrow(tbl2)){
if (tbl2$onemin[i] == FALSE){
group_ids[i] <- group_ids[i - 1] +1
} else
group_ids[i] <- group_ids[i - 1]
}
tbl2 %>%
bind_cols(., group = group_ids) %>%
select(id, time, group)
# A tibble: 8 x 3
id time group
<int> <dttm> <dbl>
1 1 2017-05-12 18:34:20 1.00
2 2 2017-05-12 18:37:07 2.00
3 3 2017-05-12 18:37:10 2.00
4 4 2017-05-12 18:37:59 2.00
5 5 2017-05-12 18:38:40 2.00
6 6 2017-05-12 18:40:40 3.00
7 7 2017-05-12 18:40:49 3.00
8 8 2017-05-12 18:43:40 4.00
Upvotes: 1