Reputation: 529

Group by time proximity between adjecent rows

I have dataset like this

      id  data                time moreData
   <int> <int>              <dttm>    <dbl>
 1     1     4 2017-05-12 18:34:20     4450
 2     2     4 2017-05-12 18:37:07     2800
 3     3     4 2017-05-12 18:37:10     1900
 4     4     4 2017-05-12 18:37:59     1950
 5     5     4 2017-05-12 18:38:40     2500

containing timestamps. You could say this data are "requests to a website" and i want to approximate "sessions".

In other words, I wish to group the rows 1, 2 , ... , n in groups, if time difference between row i and i+1 is less then let's say less than 1 minute.

Therefor, the data would be grouped in {1} and {2,3,4,5}.

Please note this is not a duplicate question of other questions asking about grouping in predetermined time intervals - I do not care how big the time difference between first and last element is, I care only about difference in adjacent rows.

How can I achieve this?

Sample data:

structure(list(id = 1:20, user = c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), time = structure(c(1494606860, 
1494607027, 1494607030, 1494607172, 1494607173, 1494607197, 1494607198, 
1494607200, 1494607309, 1494607312, 1494607339, 1494607340, 1494607343, 
1494607343, 1494607404, 1494607405, 1494607407, 1494607492, 1494607493, 
1494607495), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("id", 
"user", "time"), row.names = c(NA, -20L), class = c("tbl_df", 
"tbl", "data.frame"))

Upvotes: 2

Answers (3)

MKR

Reputation: 20095

One possible solution could be using lag function from dplyr package and cumsum form base r.

The approach is:

Find time difference between each row in seconds
If difftime is more than 60 than that row is in new group (newgroup)
Perform cumsum on newgroup to get group number of each row.

The code is:

    #data
    library(dplyr)
df <- structure(list(id = 1:20, user = c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 
     4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), 
     time = structure(c(1494606860,1494607027, 1494607030, 1494607172, 1494607173, 1494607197, 1494607198, 
          1494607200, 1494607309, 1494607312, 1494607339, 1494607340, 1494607343, 
           1494607343, 1494607404, 1494607405, 1494607407, 1494607492, 1494607493, 
          1494607495), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("id", 
          "user", "time"), row.names = c(NA, -20L), class = c("tbl_df", 
              "tbl", "data.frame"))



df %>% mutate(difftime = ifelse(is.na(as.numeric(time - lag(time))),0,as.numeric(time - lag(time)))) %>%
       mutate(newroup = ifelse(difftime > 60, 1, 0)) %>%
       mutate(group = factor(cumsum(newroup))) %>%
      select(id, user,time, group)

    #Result
      id  user time                group 
   <int> <int> <dttm>              <fctr>
 1     1     4 2017-05-12 17:34:20 0     
 2     2     4 2017-05-12 17:37:07 1     
 3     3     4 2017-05-12 17:37:10 1     
 4     4     4 2017-05-12 17:39:32 2     
 5     5     4 2017-05-12 17:39:33 2     
 6     6     4 2017-05-12 17:39:57 2     
 7     7     4 2017-05-12 17:39:58 2     
 8     8     4 2017-05-12 17:40:00 2     
 9     9     4 2017-05-12 17:41:49 3     
10    10     4 2017-05-12 17:41:52 3     
11    11     4 2017-05-12 17:42:19 3     
12    12     4 2017-05-12 17:42:20 3     
13    13     4 2017-05-12 17:42:23 3     
14    14     4 2017-05-12 17:42:23 3     
15    15     4 2017-05-12 17:43:24 4     
16    16     4 2017-05-12 17:43:25 4     
17    17     4 2017-05-12 17:43:27 4     
18    18     4 2017-05-12 17:44:52 5     
19    19     4 2017-05-12 17:44:53 5     
20    20     4 2017-05-12 17:44:55 5

Upvotes: 1

pogibas

Reputation: 28339

You can use difftime function from base R.

Code:

# Wanted time difference in minutes
wantedDiff <- 1
timeDiff <- abs(difftime(df$time[-nrow(df)], 
                         df$time[-1], 
                         units = "mins"))
df$group <- cumsum(c(0, as.numeric(timeDiff >= wantedDiff)))

Result:

   id user                time group
1   1    4 2017-05-12 19:34:20     0
2   2    4 2017-05-12 19:37:07     1
3   3    4 2017-05-12 19:37:10     1
4   4    4 2017-05-12 19:39:32     2
5   5    4 2017-05-12 19:39:33     2
6   6    4 2017-05-12 19:39:57     2
7   7    4 2017-05-12 19:39:58     2
8   8    4 2017-05-12 19:40:00     2
9   9    4 2017-05-12 19:41:49     3
10 10    4 2017-05-12 19:41:52     3
11 11    4 2017-05-12 19:42:19     3
12 12    4 2017-05-12 19:42:20     3
13 13    4 2017-05-12 19:42:23     3
14 14    4 2017-05-12 19:42:23     3
15 15    4 2017-05-12 19:43:24     4
16 16    4 2017-05-12 19:43:25     4
17 17    4 2017-05-12 19:43:27     4
18 18    4 2017-05-12 19:44:52     5
19 19    4 2017-05-12 19:44:53     5
20 20    4 2017-05-12 19:44:55     5

Explanation:

Calculate absolute difference in time between current and previous row using difftime
- We can specify difference units here
- Output (timeDiff) looks like this:

Time differences in mins
 [1] 2.78333333 0.05000000 2.36666667 0.01666667 0.40000000 0.01666667 0.03333333 1.81666667 0.05000000 0.45000000
[11] 0.01666667 0.05000000 0.00000000 1.01666667 0.01666667 0.03333333 1.41666667 0.01666667 0.03333333

Test if difference in time is greater or equal to wantedDiff and transform this logical output to numeric
cumsum that numeric output (adds +1, ie switches to new group)

Data:

df <- structure(list(id = 1:20, user = c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), time = structure(c(1494606860, 
1494607027, 1494607030, 1494607172, 1494607173, 1494607197, 1494607198, 
1494607200, 1494607309, 1494607312, 1494607339, 1494607340, 1494607343, 
1494607343, 1494607404, 1494607405, 1494607407, 1494607492, 1494607493, 
1494607495), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("id", 
"user", "time"), row.names = c(NA, -20L), class = c("tbl_df", 
"tbl", "data.frame"))

Upvotes: 2

Calum You

Reputation: 15072

Here is a solution using an expanded set of your example data. The key parts of this approach are using lubridate::ymd_hms to convert strings into times that you can do arithmetic with, and then lag to figure out whether or not time is within a minute of the previous row. Then, you can use a for loop to create the groups by incrementing the group number every time you get to a row that is not within one minute of the previous row. Can certainly be neatened up a little, and would love to see if anyone can do this without resorting to a for loop and bind_cols!

library(tidyverse)
tbl <- tibble(
  id = 1:8,
  time = c("2017-05-12 18:34:20",
           "2017-05-12 18:37:07",
           "2017-05-12 18:37:10",
           "2017-05-12 18:37:59",
           "2017-05-12 18:38:40",
           "2017-05-12 18:40:40",
           "2017-05-12 18:40:49",
           "2017-05-12 18:43:40"
           )
)

tbl2 <- tbl %>%
  mutate(time = ymd_hms(time)) %>%
  mutate(separation = time - lag(time, default = 0)) %>%
  mutate(onemin = separation <= 60)

group_ids = 1
for (i in 2:nrow(tbl2)){
  if (tbl2$onemin[i] == FALSE){
    group_ids[i] <- group_ids[i - 1] +1
  } else
  group_ids[i] <- group_ids[i - 1]
}

tbl2 %>%
  bind_cols(., group = group_ids) %>%
  select(id, time, group)

# A tibble: 8 x 3
     id time                group
  <int> <dttm>              <dbl>
1     1 2017-05-12 18:34:20  1.00
2     2 2017-05-12 18:37:07  2.00
3     3 2017-05-12 18:37:10  2.00
4     4 2017-05-12 18:37:59  2.00
5     5 2017-05-12 18:38:40  2.00
6     6 2017-05-12 18:40:40  3.00
7     7 2017-05-12 18:40:49  3.00
8     8 2017-05-12 18:43:40  4.00

Upvotes: 1

Group by time proximity between adjecent rows

Answers (3)

Related Questions