Kiraa
Kiraa

Reputation: 529

Group by time proximity between adjecent rows

I have dataset like this

      id  data                time moreData
   <int> <int>              <dttm>    <dbl>
 1     1     4 2017-05-12 18:34:20     4450
 2     2     4 2017-05-12 18:37:07     2800
 3     3     4 2017-05-12 18:37:10     1900
 4     4     4 2017-05-12 18:37:59     1950
 5     5     4 2017-05-12 18:38:40     2500

containing timestamps. You could say this data are "requests to a website" and i want to approximate "sessions".

In other words, I wish to group the rows 1, 2 , ... , n in groups, if time difference between row i and i+1 is less then let's say less than 1 minute.

Therefor, the data would be grouped in {1} and {2,3,4,5}.

Please note this is not a duplicate question of other questions asking about grouping in predetermined time intervals - I do not care how big the time difference between first and last element is, I care only about difference in adjacent rows.

How can I achieve this?

Sample data:

structure(list(id = 1:20, user = c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), time = structure(c(1494606860, 
1494607027, 1494607030, 1494607172, 1494607173, 1494607197, 1494607198, 
1494607200, 1494607309, 1494607312, 1494607339, 1494607340, 1494607343, 
1494607343, 1494607404, 1494607405, 1494607407, 1494607492, 1494607493, 
1494607495), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("id", 
"user", "time"), row.names = c(NA, -20L), class = c("tbl_df", 
"tbl", "data.frame"))

Upvotes: 2

Views: 112

Answers (3)

MKR
MKR

Reputation: 20095

One possible solution could be using lag function from dplyr package and cumsum form base r.

The approach is:

  • Find time difference between each row in seconds
  • If difftime is more than 60 than that row is in new group (newgroup)
  • Perform cumsum on newgroup to get group number of each row.

The code is:

    #data
    library(dplyr)
df <- structure(list(id = 1:20, user = c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 
     4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), 
     time = structure(c(1494606860,1494607027, 1494607030, 1494607172, 1494607173, 1494607197, 1494607198, 
          1494607200, 1494607309, 1494607312, 1494607339, 1494607340, 1494607343, 
           1494607343, 1494607404, 1494607405, 1494607407, 1494607492, 1494607493, 
          1494607495), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("id", 
          "user", "time"), row.names = c(NA, -20L), class = c("tbl_df", 
              "tbl", "data.frame"))



df %>% mutate(difftime = ifelse(is.na(as.numeric(time - lag(time))),0,as.numeric(time - lag(time)))) %>%
       mutate(newroup = ifelse(difftime > 60, 1, 0)) %>%
       mutate(group = factor(cumsum(newroup))) %>%
      select(id, user,time, group)

    #Result
      id  user time                group 
   <int> <int> <dttm>              <fctr>
 1     1     4 2017-05-12 17:34:20 0     
 2     2     4 2017-05-12 17:37:07 1     
 3     3     4 2017-05-12 17:37:10 1     
 4     4     4 2017-05-12 17:39:32 2     
 5     5     4 2017-05-12 17:39:33 2     
 6     6     4 2017-05-12 17:39:57 2     
 7     7     4 2017-05-12 17:39:58 2     
 8     8     4 2017-05-12 17:40:00 2     
 9     9     4 2017-05-12 17:41:49 3     
10    10     4 2017-05-12 17:41:52 3     
11    11     4 2017-05-12 17:42:19 3     
12    12     4 2017-05-12 17:42:20 3     
13    13     4 2017-05-12 17:42:23 3     
14    14     4 2017-05-12 17:42:23 3     
15    15     4 2017-05-12 17:43:24 4     
16    16     4 2017-05-12 17:43:25 4     
17    17     4 2017-05-12 17:43:27 4     
18    18     4 2017-05-12 17:44:52 5     
19    19     4 2017-05-12 17:44:53 5     
20    20     4 2017-05-12 17:44:55 5    

Upvotes: 1

pogibas
pogibas

Reputation: 28339

You can use difftime function from base R.

Code:

# Wanted time difference in minutes
wantedDiff <- 1
timeDiff <- abs(difftime(df$time[-nrow(df)], 
                         df$time[-1], 
                         units = "mins"))
df$group <- cumsum(c(0, as.numeric(timeDiff >= wantedDiff)))

Result:

   id user                time group
1   1    4 2017-05-12 19:34:20     0
2   2    4 2017-05-12 19:37:07     1
3   3    4 2017-05-12 19:37:10     1
4   4    4 2017-05-12 19:39:32     2
5   5    4 2017-05-12 19:39:33     2
6   6    4 2017-05-12 19:39:57     2
7   7    4 2017-05-12 19:39:58     2
8   8    4 2017-05-12 19:40:00     2
9   9    4 2017-05-12 19:41:49     3
10 10    4 2017-05-12 19:41:52     3
11 11    4 2017-05-12 19:42:19     3
12 12    4 2017-05-12 19:42:20     3
13 13    4 2017-05-12 19:42:23     3
14 14    4 2017-05-12 19:42:23     3
15 15    4 2017-05-12 19:43:24     4
16 16    4 2017-05-12 19:43:25     4
17 17    4 2017-05-12 19:43:27     4
18 18    4 2017-05-12 19:44:52     5
19 19    4 2017-05-12 19:44:53     5
20 20    4 2017-05-12 19:44:55     5

Explanation:

  • Calculate absolute difference in time between current and previous row using difftime
    • We can specify difference units here
    • Output (timeDiff) looks like this:
Time differences in mins
 [1] 2.78333333 0.05000000 2.36666667 0.01666667 0.40000000 0.01666667 0.03333333 1.81666667 0.05000000 0.45000000
[11] 0.01666667 0.05000000 0.00000000 1.01666667 0.01666667 0.03333333 1.41666667 0.01666667 0.03333333
  • Test if difference in time is greater or equal to wantedDiff and transform this logical output to numeric
  • cumsum that numeric output (adds +1, ie switches to new group)

Data:

df <- structure(list(id = 1:20, user = c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), time = structure(c(1494606860, 
1494607027, 1494607030, 1494607172, 1494607173, 1494607197, 1494607198, 
1494607200, 1494607309, 1494607312, 1494607339, 1494607340, 1494607343, 
1494607343, 1494607404, 1494607405, 1494607407, 1494607492, 1494607493, 
1494607495), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("id", 
"user", "time"), row.names = c(NA, -20L), class = c("tbl_df", 
"tbl", "data.frame"))

Upvotes: 2

Calum You
Calum You

Reputation: 15072

Here is a solution using an expanded set of your example data. The key parts of this approach are using lubridate::ymd_hms to convert strings into times that you can do arithmetic with, and then lag to figure out whether or not time is within a minute of the previous row. Then, you can use a for loop to create the groups by incrementing the group number every time you get to a row that is not within one minute of the previous row. Can certainly be neatened up a little, and would love to see if anyone can do this without resorting to a for loop and bind_cols!

library(tidyverse)
tbl <- tibble(
  id = 1:8,
  time = c("2017-05-12 18:34:20",
           "2017-05-12 18:37:07",
           "2017-05-12 18:37:10",
           "2017-05-12 18:37:59",
           "2017-05-12 18:38:40",
           "2017-05-12 18:40:40",
           "2017-05-12 18:40:49",
           "2017-05-12 18:43:40"
           )
)

tbl2 <- tbl %>%
  mutate(time = ymd_hms(time)) %>%
  mutate(separation = time - lag(time, default = 0)) %>%
  mutate(onemin = separation <= 60)

group_ids = 1
for (i in 2:nrow(tbl2)){
  if (tbl2$onemin[i] == FALSE){
    group_ids[i] <- group_ids[i - 1] +1
  } else
  group_ids[i] <- group_ids[i - 1]
}

tbl2 %>%
  bind_cols(., group = group_ids) %>%
  select(id, time, group)

# A tibble: 8 x 3
     id time                group
  <int> <dttm>              <dbl>
1     1 2017-05-12 18:34:20  1.00
2     2 2017-05-12 18:37:07  2.00
3     3 2017-05-12 18:37:10  2.00
4     4 2017-05-12 18:37:59  2.00
5     5 2017-05-12 18:38:40  2.00
6     6 2017-05-12 18:40:40  3.00
7     7 2017-05-12 18:40:49  3.00
8     8 2017-05-12 18:43:40  4.00

Upvotes: 1

Related Questions