Reputation: 502
I am trying to achieve an sliding window aggregation. I tried something using tidyr
functions but I am sure there are much better / faster ways to achieve.
Let me explain what I want to achieve:
I have an input dataframe dat
:
dat <- tibble(timestamp = seq.POSIXt(as.POSIXct("2019-01-01 00:00:00"), as.POSIXct("2019-01-01 02:00:00"), by = "15 min"))
set.seed(42)
dat$value <- sample(1:5, nrow(dat), replace = T)
dat
# A tibble: 9 x 2
timestamp value
<dttm> <int>
1 2019-01-01 00:00:00 5
2 2019-01-01 00:15:00 5
3 2019-01-01 00:30:00 2
4 2019-01-01 00:45:00 5
5 2019-01-01 01:00:00 4
6 2019-01-01 01:15:00 3
7 2019-01-01 01:30:00 4
8 2019-01-01 01:45:00 1
9 2019-01-01 02:00:00 4
For every row, I want to find the list of unique values from the value
field (but ignore itself if present) that appeared in the next 60 minutes. Lets call that list as nextvalue
Then expand each row to generate pairs between the value
and the nextvalue
. Then group_by
, value
and nextvalue
and summarise
the counts and sort by descending order.
I read the docs and have put the below code.
t <- dat$timestamp
value <- dat$value
getCI <- function(start, end) {
paste(value[(start+1):end], collapse = "|")
}
LETTERS <- LETTERS[1:(length(unique(value)) - 1)]
dat %>%
mutate(time_next = timestamp + 60*60) %>%
rowwise() %>%
mutate(flag = max(which(time_next >= t))) %>%
ungroup() %>%
mutate(row = row_number()) %>%
rowwise() %>%
mutate(nextvalue = getCI(row, flag)) %>%
select(value, nextvalue) %>%
separate(nextvalue, c(LETTERS), extra = "warn", fill = "right") %>%
pivot_longer(LETTERS, names_to = c("Letter"), values_to = "nextvalue") %>%
filter(!is.na(nextvalue)) %>%
filter(value != nextvalue) %>%
select(value, nextvalue) %>%
group_by(value, nextvalue) %>%
summarise(count = n()) %>%
arrange(desc(count))
# A tibble: 13 x 3
# Groups: value [5]
value nextvalue count
<int> <chr> <int>
1 5 4 4
2 2 4 2
3 3 4 2
4 4 1 2
5 5 2 2
6 5 3 2
7 1 4 1
8 2 3 1
9 2 5 1
10 3 1 1
11 4 3 1
12 4 NA 1
13 5 1 1
But I want to see interesting ways to achieve this in much less code and much simpler way. I would be interested in seeing how multicore approaches can be applied to this problem to speed up the entire computation. Please comment
Upvotes: 4
Views: 264
Reputation: 66775
I have updated my answer with a fuzzyjoin
solution that works with irregularly spaced data.
The fuzzyjoin
package is useful for a variety of "non-equi" joins. In this case, the interval_left_join
function relies on using interval trees from the IRanges package on Bioconductor to speed up a search for overlapping intervals. I expect this will be less memory-intensive (and perhaps faster) than doing a full join for all pairs of observations and filtering afterwards for overlaps.
dat %>%
mutate(end = timestamp + 60*60) %>%
fuzzyjoin::interval_left_join(., ., by = c("timestamp", "end")) %>%
# exclude self-matches, and exclude matches for the preceding hour
filter(value.x != value.y, timestamp.x < timestamp.y) %>%
count(value.x, value.y, sort = T)
# A tibble: 12 x 3
value.x value.y n
<int> <int> <int>
1 5 4 4
2 2 4 2
3 3 4 2
4 4 1 2
5 5 2 2
6 5 3 2
7 1 4 1
8 2 3 1
9 2 5 1
10 3 1 1
11 4 3 1
12 5 1 1
Orig solution:
Here's an approach that relies on the data being in even increments, so the row position is enough to know if an observation is within one hour. Each potential lead (from +15 min = 1 row to +60 min = 4 row) from 1:4 is paired to the original data using map
and a custom function. Then we count how many times each value is paired with a different, non-NA number.
library(tidyverse)
add_lead <- function(df, period_lead = 1) {
df %>% mutate(val_lead = lead(value, period_lead))
}
# 1 is the next row, 15 min later; 4 is 60 min later
map_dfr(1:4, ~add_lead(dat, .x)) %>%
filter(!is.na(val_lead),
value != val_lead) %>%
count(value, val_lead, sort = T)
# A tibble: 12 x 3
value val_lead n
<int> <int> <int>
1 5 4 4
2 2 4 2
3 3 4 2
4 4 1 2
5 5 2 2
6 5 3 2
7 1 4 1
8 2 3 1
9 2 5 1
10 3 1 1
11 4 3 1
12 5 1 1
Upvotes: 1
Reputation: 4242
This problem can be solved in a computationally efficient manner using the data.table
package.
library(data.table)
dat <- data.table(timestamp = seq.POSIXt(as.POSIXct("2019-01-01 00:00:00"), as.POSIXct("2019-01-01 02:00:00"), by = "15 min"),
value = c(5L,5L,2L,5L,4L,3L,4L,1L,4L))
dat[,time_next := timestamp + 60*60]
dat[dat,.(value = i.value,
nextvalue = value), on = .(timestamp>timestamp,
timestamp<=time_next)
][!value == nextvalue, .(count = .N), by = .(value,nextvalue)
][order(-count)]
# value nextvalue count
# 1: 5 4 4
# 2: 5 2 2
# 3: 5 3 2
# 4: 2 4 2
# 5: 4 1 2
# 6: 3 4 2
# 7: 2 5 1
# 8: 2 3 1
# 9: 5 1 1
# 10: 4 3 1
# 11: 3 1 1
# 12: 1 4 1
First, we use a non-equi join to join the table to itself on rows where a given row's timestamp
falls between another row's timestamp
and time_next
values.
dat[dat,.(value = i.value,
nextvalue = value), on = .(timestamp>timestamp,
timestamp<=time_next)
]
Then two additional steps are "chained" on.
Results where the value
is the same as nextvalue
are excluded in the first clause (referred to as the i clause in data.table
documentation), grouping variables are defined in the third (k) clause , and we perform an operation to count the occurences by group in the second (j) clause.
[!value == nextvalue, .(count = .N), by = .(value,nextvalue)
]
Finally, additional step to sort the results in descending order of counts is chained on:
[order(-count)]
Generating some synthetic data with one million rows, this solution runs in less than one second on my computer and uses approximately 657Mb of memory.
library(data.table)
RowCount <- 1e6
dat <- data.table(timestamp = seq.POSIXt(from = as.POSIXct("2019-01-01 00:00:00"),
by = "15 min",
length.out = RowCount),
value = sample.int(5L,RowCount,replace=TRUE))
dat[,time_next := timestamp + 60*60]
system.time({
dat[dat,.(value = i.value,
nextvalue = value), on = .(timestamp>timestamp,
timestamp<=time_next)
][!value == nextvalue, .(count = .N), by = .(value,nextvalue)
][order(-count)]
})
# user system elapsed
# 0.988 0.048 0.884
Upvotes: 3
Reputation: 650
In this approach, we create four copies of each row, each with a different "earlier" timestamp to inner_join()
the original dataset into.
library(lubridate)
library(tidyverse)
dat %>%
replicate(4, ., simplify = FALSE) %>%
bind_rows(.id = "offset") %>%
mutate(timestamp = timestamp %m-% minutes(as.numeric(offset) * 15)) %>%
inner_join(dat, by = "timestamp") %>%
count(value = value.y, next_value = value.x, name = "count") %>%
arrange(-count)
value_earlier value count
<int> <int> <int>
1 5 4 4
2 4 4 3
3 5 5 3
4 2 4 2
5 3 4 2
6 4 1 2
7 5 2 2
8 5 3 2
9 1 4 1
10 2 3 1
11 2 5 1
12 3 1 1
13 4 3 1
14 5 1 1
Upvotes: 0
Reputation: 357
My solution goes like this: You start by creating a full join of the tibble with itself, I used a dummy variable to do so:
dat <- mutate(dat, allc=1)
> dat
# A tibble: 9 x 3
timestamp value allc
<dttm> <int> <dbl>
1 2019-01-01 00:00:00 1 1
2 2019-01-01 00:15:00 5 1
3 2019-01-01 00:30:00 1 1
4 2019-01-01 00:45:00 1 1
5 2019-01-01 01:00:00 2 1
6 2019-01-01 01:15:00 4 1
7 2019-01-01 01:30:00 2 1
8 2019-01-01 01:45:00 2 1
9 2019-01-01 02:00:00 1 1
And then join, filter according to date and summarize:
dat %>%
full_join(dat, by="allc") %>%
filter(timestamp.x < timestamp.y,
timestamp.y < timestamp.x+60*60,
value.x!=value.y) %>%
group_by(value.x, value.y) %>%
summarize(count=n())
# A tibble: 9 x 3
# Groups: value.x [4]
value.x value.y count
<int> <int> <int>
1 1 2 3
2 1 4 2
3 1 5 1
4 2 1 2
5 2 4 1
6 4 1 1
7 4 2 2
8 5 1 2
9 5 2 1
Upvotes: 0
Reputation: 6941
If all your data is in quarter-hour time steps then you can make use of lag/lead functions. This moves the focus from how many minutes are between two records you will compare, to how many records are between the two records you will compare.
dat
.Base approach:
dat2 = dat %>%
mutate(next_value = lead(value, N, order_by = timestamp)) %>%
filter(!is.na(next_value),
value != next_value) %>%
group_by(value, next_value) %>%
summarise(num = n())
The above produces the requested summary but only when the time intervals are N
15 minute intervals apart. So we need to repeat it for N = 1,2,3,4
. An efficient way to do this is using lapply
.
dat2 <- function(N){
dat %>%
mutate(next_value = lead(value, 1, order_by = timestamp)) %>%
filter(!is.na(next_value),
value != next_value) %>%
group_by(value, next_value) %>%
summarise(num = n()) %>%
mutate(step_size = N) # for checking which value of N was used
}
df_list <- lapply(1:4, dat2)
result = do.call(rbind, df_list) %>% # combine all df into one
group_by(value, next_value) %>% # get a total count over all indiv df's
summarise(num = sum(num))
rbindlist
could also be used in place of do.call(rbind, df_list)
Upvotes: 0
Reputation: 389155
The seed gave me different value but here is my attempt.
We get the unique value
which is present in next 1 hour and add it as a list column (next_value
) which is then unnested to create separate rows for each value. We then remove rows where value
and next_value
are same and count
number of rows for every combination of value
and next_value
, arranging them in decreasing order.
library(tidyverse)
dat %>%
mutate(next_value = map(timestamp, ~unique(value[
timestamp > .x & timestamp <= (.x + 60 * 60)]))) %>%
unnest(next_value) %>%
filter(value != next_value) %>%
count(value, next_value, name = 'count') %>%
arrange(desc(count))
# A tibble: 12 x 3
# value next_value count
# <dbl> <int> <int>
# 1 5 4 3
# 2 4 1 2
# 3 5 2 2
# 4 5 3 2
# 5 1 4 1
# 6 2 3 1
# 7 2 4 1
# 8 2 5 1
# 9 3 1 1
#10 3 4 1
#11 4 3 1
#12 5 1 1
Another variation that is different than the above in one-step. Instead of list columns we create a comma-separated string of unique value
, get them in separate rows and count same as above.
dat %>%
mutate(next_value = map_chr(timestamp, ~toString(unique(value[
timestamp > .x & timestamp <= (.x + 60 * 60)])))) %>%
separate_rows(next_value, sep = ",", convert = TRUE) %>%
filter(value != next_value) %>%
count(value, next_value, name = 'count') %>%
arrange(desc(count))
data
dat <- tibble(timestamp = seq.POSIXt(as.POSIXct("2019-01-01 00:00:00"),
as.POSIXct("2019-01-01 02:00:00"), by = "15 min"))
dat$value <- c(5, 5, 2, 5, 4, 3, 4,1, 4)
Upvotes: 0