sachinv
sachinv

Reputation: 502

Mutate Next unique values in one hour and expand and aggregate

I am trying to achieve an sliding window aggregation. I tried something using tidyr functions but I am sure there are much better / faster ways to achieve.

Let me explain what I want to achieve:

I have an input dataframe dat:

dat <- tibble(timestamp = seq.POSIXt(as.POSIXct("2019-01-01 00:00:00"), as.POSIXct("2019-01-01 02:00:00"), by = "15 min"))
set.seed(42)
dat$value <- sample(1:5, nrow(dat), replace = T)
dat
# A tibble: 9 x 2
  timestamp           value
  <dttm>              <int>
1 2019-01-01 00:00:00     5
2 2019-01-01 00:15:00     5
3 2019-01-01 00:30:00     2
4 2019-01-01 00:45:00     5
5 2019-01-01 01:00:00     4
6 2019-01-01 01:15:00     3
7 2019-01-01 01:30:00     4
8 2019-01-01 01:45:00     1
9 2019-01-01 02:00:00     4

For every row, I want to find the list of unique values from the value field (but ignore itself if present) that appeared in the next 60 minutes. Lets call that list as nextvalue Then expand each row to generate pairs between the value and the nextvalue. Then group_by, value and nextvalue and summarise the counts and sort by descending order.

I read the docs and have put the below code.

t <- dat$timestamp
value <- dat$value

getCI <- function(start, end) {
  paste(value[(start+1):end], collapse = "|")
}

LETTERS <- LETTERS[1:(length(unique(value)) - 1)]

dat %>%
  mutate(time_next = timestamp + 60*60) %>%
  rowwise() %>%
  mutate(flag = max(which(time_next >= t))) %>%
  ungroup() %>%
  mutate(row = row_number()) %>%
  rowwise() %>%
  mutate(nextvalue = getCI(row, flag)) %>%
  select(value, nextvalue) %>%
  separate(nextvalue, c(LETTERS), extra = "warn", fill = "right") %>%
  pivot_longer(LETTERS, names_to = c("Letter"), values_to = "nextvalue") %>%
  filter(!is.na(nextvalue)) %>%
  filter(value != nextvalue) %>%
  select(value, nextvalue) %>%
  group_by(value, nextvalue) %>%
  summarise(count = n()) %>%
  arrange(desc(count))
# A tibble: 13 x 3
# Groups:   value [5]
   value nextvalue count
   <int> <chr>     <int>
 1     5 4             4
 2     2 4             2
 3     3 4             2
 4     4 1             2
 5     5 2             2
 6     5 3             2
 7     1 4             1
 8     2 3             1
 9     2 5             1
10     3 1             1
11     4 3             1
12     4 NA            1
13     5 1             1

But I want to see interesting ways to achieve this in much less code and much simpler way. I would be interested in seeing how multicore approaches can be applied to this problem to speed up the entire computation. Please comment

Upvotes: 4

Views: 264

Answers (6)

Jon Spring
Jon Spring

Reputation: 66775

I have updated my answer with a fuzzyjoin solution that works with irregularly spaced data.

The fuzzyjoin package is useful for a variety of "non-equi" joins. In this case, the interval_left_join function relies on using interval trees from the IRanges package on Bioconductor to speed up a search for overlapping intervals. I expect this will be less memory-intensive (and perhaps faster) than doing a full join for all pairs of observations and filtering afterwards for overlaps.

dat %>%
  mutate(end = timestamp + 60*60) %>%
  fuzzyjoin::interval_left_join(., ., by = c("timestamp", "end")) %>%
  # exclude self-matches, and exclude matches for the preceding hour
  filter(value.x != value.y, timestamp.x < timestamp.y) %>%
  count(value.x, value.y, sort = T)

# A tibble: 12 x 3
   value.x value.y     n
     <int>   <int> <int>
 1       5       4     4
 2       2       4     2
 3       3       4     2
 4       4       1     2
 5       5       2     2
 6       5       3     2
 7       1       4     1
 8       2       3     1
 9       2       5     1
10       3       1     1
11       4       3     1
12       5       1     1

Orig solution:

Here's an approach that relies on the data being in even increments, so the row position is enough to know if an observation is within one hour. Each potential lead (from +15 min = 1 row to +60 min = 4 row) from 1:4 is paired to the original data using map and a custom function. Then we count how many times each value is paired with a different, non-NA number.

library(tidyverse)

add_lead <- function(df, period_lead = 1) {
  df %>% mutate(val_lead = lead(value, period_lead)) 
}

# 1 is the next row, 15 min later; 4 is 60 min later
map_dfr(1:4, ~add_lead(dat, .x)) %>%
  filter(!is.na(val_lead),
         value != val_lead) %>%
  count(value, val_lead, sort = T)

# A tibble: 12 x 3
   value val_lead     n
   <int>    <int> <int>
 1     5        4     4
 2     2        4     2
 3     3        4     2
 4     4        1     2
 5     5        2     2
 6     5        3     2
 7     1        4     1
 8     2        3     1
 9     2        5     1
10     3        1     1
11     4        3     1
12     5        1     1

Upvotes: 1

Matt Summersgill
Matt Summersgill

Reputation: 4242

This problem can be solved in a computationally efficient manner using the data.table package.

library(data.table)

dat <- data.table(timestamp = seq.POSIXt(as.POSIXct("2019-01-01 00:00:00"), as.POSIXct("2019-01-01 02:00:00"), by = "15 min"),
                  value = c(5L,5L,2L,5L,4L,3L,4L,1L,4L))

dat[,time_next := timestamp + 60*60]


dat[dat,.(value = i.value,
          nextvalue = value), on = .(timestamp>timestamp,
                                     timestamp<=time_next)
    ][!value == nextvalue, .(count = .N), by = .(value,nextvalue)
      ][order(-count)]

#     value nextvalue count
#  1:     5         4     4
#  2:     5         2     2
#  3:     5         3     2
#  4:     2         4     2
#  5:     4         1     2
#  6:     3         4     2
#  7:     2         5     1
#  8:     2         3     1
#  9:     5         1     1
# 10:     4         3     1
# 11:     3         1     1
# 12:     1         4     1

Breaking down the solution:

First, we use a non-equi join to join the table to itself on rows where a given row's timestamp falls between another row's timestamp and time_next values.

dat[dat,.(value = i.value,
          nextvalue = value), on = .(timestamp>timestamp,
                                     timestamp<=time_next)
    ]

Then two additional steps are "chained" on.

Results where the value is the same as nextvalue are excluded in the first clause (referred to as the i clause in data.table documentation), grouping variables are defined in the third (k) clause , and we perform an operation to count the occurences by group in the second (j) clause.

[!value == nextvalue, .(count = .N), by = .(value,nextvalue)
      ]

Finally, additional step to sort the results in descending order of counts is chained on:

[order(-count)]

Performance

Generating some synthetic data with one million rows, this solution runs in less than one second on my computer and uses approximately 657Mb of memory.

library(data.table)
RowCount <- 1e6
dat <- data.table(timestamp = seq.POSIXt(from = as.POSIXct("2019-01-01 00:00:00"),
                                         by = "15 min",
                                         length.out = RowCount),
                  value = sample.int(5L,RowCount,replace=TRUE))

dat[,time_next := timestamp + 60*60]

system.time({

  dat[dat,.(value = i.value,
            nextvalue = value), on = .(timestamp>timestamp,
                                       timestamp<=time_next)
      ][!value == nextvalue, .(count = .N), by = .(value,nextvalue)
        ][order(-count)]

})

#  user  system elapsed 
# 0.988   0.048   0.884

Upvotes: 3

chakuRak
chakuRak

Reputation: 650

In this approach, we create four copies of each row, each with a different "earlier" timestamp to inner_join() the original dataset into.

library(lubridate)
library(tidyverse)

dat %>%
  replicate(4, ., simplify = FALSE) %>%
  bind_rows(.id = "offset") %>%
  mutate(timestamp = timestamp %m-% minutes(as.numeric(offset) * 15)) %>%
  inner_join(dat, by = "timestamp") %>%
  count(value = value.y, next_value = value.x, name = "count") %>%
  arrange(-count)

   value_earlier value count
           <int> <int> <int>
 1             5     4     4
 2             4     4     3
 3             5     5     3
 4             2     4     2
 5             3     4     2
 6             4     1     2
 7             5     2     2
 8             5     3     2
 9             1     4     1
10             2     3     1
11             2     5     1
12             3     1     1
13             4     3     1
14             5     1     1

Upvotes: 0

asafpr
asafpr

Reputation: 357

My solution goes like this: You start by creating a full join of the tibble with itself, I used a dummy variable to do so:

dat <- mutate(dat, allc=1)
> dat
# A tibble: 9 x 3
  timestamp           value  allc
  <dttm>              <int> <dbl>
1 2019-01-01 00:00:00     1     1
2 2019-01-01 00:15:00     5     1
3 2019-01-01 00:30:00     1     1
4 2019-01-01 00:45:00     1     1
5 2019-01-01 01:00:00     2     1
6 2019-01-01 01:15:00     4     1
7 2019-01-01 01:30:00     2     1
8 2019-01-01 01:45:00     2     1
9 2019-01-01 02:00:00     1     1

And then join, filter according to date and summarize:

dat %>%
  full_join(dat, by="allc") %>% 
  filter(timestamp.x < timestamp.y,
         timestamp.y < timestamp.x+60*60,
         value.x!=value.y) %>%
  group_by(value.x, value.y) %>%
  summarize(count=n())
# A tibble: 9 x 3
# Groups:   value.x [4]
  value.x value.y count
    <int>   <int> <int>
1       1       2     3
2       1       4     2
3       1       5     1
4       2       1     2
5       2       4     1
6       4       1     1
7       4       2     2
8       5       1     2
9       5       2     1

Upvotes: 0

Simon.S.A.
Simon.S.A.

Reputation: 6941

If all your data is in quarter-hour time steps then you can make use of lag/lead functions. This moves the focus from how many minutes are between two records you will compare, to how many records are between the two records you will compare.

  • This method compacts the results early so it uses very little additional memory.
  • This method implicitly sorts your data by timestamp
    • If your data is sorted run time is approximately linear in the size of dat.
    • But if your data is not sorted this will increase run time to log-linear.

Base approach:

dat2 = dat %>%
  mutate(next_value = lead(value, N, order_by = timestamp)) %>%
  filter(!is.na(next_value),
         value != next_value) %>%
  group_by(value, next_value) %>%
  summarise(num = n())

The above produces the requested summary but only when the time intervals are N 15 minute intervals apart. So we need to repeat it for N = 1,2,3,4. An efficient way to do this is using lapply.

dat2 <- function(N){
  dat %>%
    mutate(next_value = lead(value, 1, order_by = timestamp)) %>%
    filter(!is.na(next_value),
           value != next_value) %>%
    group_by(value, next_value) %>%
    summarise(num = n()) %>%
    mutate(step_size = N)      # for checking which value of N was used
}

df_list <- lapply(1:4, dat2)

result = do.call(rbind, df_list) %>%  # combine all df into one
  group_by(value, next_value) %>%     # get a total count over all indiv df's
  summarise(num = sum(num))

rbindlist could also be used in place of do.call(rbind, df_list)

Upvotes: 0

Ronak Shah
Ronak Shah

Reputation: 389155

The seed gave me different value but here is my attempt.

We get the unique value which is present in next 1 hour and add it as a list column (next_value) which is then unnested to create separate rows for each value. We then remove rows where value and next_value are same and count number of rows for every combination of value and next_value, arranging them in decreasing order.

library(tidyverse)

dat %>%
  mutate(next_value = map(timestamp, ~unique(value[
                          timestamp > .x & timestamp <= (.x + 60 * 60)]))) %>%
  unnest(next_value) %>%
  filter(value != next_value) %>%
  count(value, next_value, name = 'count') %>%
  arrange(desc(count))


# A tibble: 12 x 3
#   value next_value count
#   <dbl>      <int> <int>
# 1     5          4     3
# 2     4          1     2
# 3     5          2     2
# 4     5          3     2
# 5     1          4     1
# 6     2          3     1
# 7     2          4     1
# 8     2          5     1
# 9     3          1     1
#10     3          4     1
#11     4          3     1
#12     5          1     1

Another variation that is different than the above in one-step. Instead of list columns we create a comma-separated string of unique value, get them in separate rows and count same as above.

dat %>%
    mutate(next_value = map_chr(timestamp, ~toString(unique(value[
                timestamp > .x & timestamp <= (.x + 60 * 60)])))) %>%
    separate_rows(next_value, sep = ",", convert = TRUE) %>%
    filter(value != next_value) %>%
    count(value, next_value, name = 'count') %>%
    arrange(desc(count))

data

dat <- tibble(timestamp = seq.POSIXt(as.POSIXct("2019-01-01 00:00:00"), 
                           as.POSIXct("2019-01-01 02:00:00"), by = "15 min"))
dat$value <- c(5, 5, 2, 5, 4, 3, 4,1, 4)

Upvotes: 0

Related Questions