Reputation: 790
I have the following code below to try and loop through a sequence and select values below these values in a sequence and find the difference from another value. For large datasets, this can take a long time. Is there a way to vectorize something like this without looping through the sequence to improve performance?
a <- seq(1, 10, by=0.25)
b <- seq(1, 10, by=1)
c <- vector('list', length(b))
i <- 1
for (n in b){
c[[i]] <- sum(n - a[n >= a])
i <- i + 1
}
data.frame(c)
I've tried to use data.table to bin the data and find the difference, but cannot figure out how to calculate the difference from every value less than the bin value.
library(data.table)
min.n <- 1
max.n <- 10
a <- data.table(seq(min.n, max.n, by=0.5))
colnames(a) <- 'a'
b <- seq(min.n+1, max.n+1, by=1)
bins <- findInterval(a$a,b)
a[,bins:= bins+2]
a[, diff:= bins - a]
Upvotes: 3
Views: 688
Reputation: 25225
Here is an option using data.table
using rolling join:
library(data.table)
A <- data.table(a, key="a")
B <- data.table(b, key="b")
A[, c("N", "cs") := .(.I, cumsum(a))]
A[B, on=.(a=b), roll=Inf, N * b - cs]
sum a[a <= n]
can be replaced with cumsum (i.e. cs
here) and rolling join will find those a
that are less than b
. Replace sum(n - cs)
with a mathematical formula involving the summation symbol so that sum(constant)
= number of elements in summation * constant.
output:
[1] 0.0 2.5 9.0 19.5 34.0 52.5 75.0 101.5 132.0 166.5
edit: some timings for reference
timing code:
set.seed(0L)
library(data.table)
n <- 1e5L
a <- rnorm(n)
b <- rnorm(n/10L)
A <- data.table(a, key="a")
B <- data.table(b, key="b")
mtd0 <- function() A[B, on = .(a <= b), sum(i.b - x.a), by = .EACHI]$V1
mtd1 <- function() {
A[, c("N", "cs") := .(.I, cumsum(a))]
A[B, on=.(a=b), roll=Inf, N * b - cs]
}
all.equal(mtd0(), mtd1())
#[1] TRUE
microbenchmark::microbenchmark(times=1L, mtd0(), mtd1())
timings:
Unit: milliseconds
expr min lq mean median uq max neval
mtd0() 2998.208000 2998.208000 2998.208000 2998.208000 2998.208000 2998.208000 1
mtd1() 7.807637 7.807637 7.807637 7.807637 7.807637 7.807637 1
Upvotes: 3
Reputation: 42564
With data.table
, this can be achieved by aggregating in a non-equi join:
library(data.table)
data.table(a)[data.table(b), on = .(a <= b), sum(i.b - x.a), by = .EACHI]$V1
[1] 0.0 2.5 9.0 19.5 34.0 52.5 75.0 101.5 132.0 166.5
In a way, it is similar to MattB's approach but combines the cartesian product CJ()
and subsetting in the non-equi join thereby avoiding to create data which will be filtered out subsequently.
Note that the x.
prefix is required to pick the a
column from the first data.table.
Alternatively, sum(i.b - x.a)
can be re-written as .N * b - sum(x.a)
where the special symbol .N
denotes the number of rows in a group.
data.table(a)[data.table(b), on = .(a <= b), .N * b - sum(x.a), by = .EACHI]$V1
[1] 0.0 2.5 9.0 19.5 34.0 52.5 75.0 101.5 132.0 166.5
Upvotes: 3
Reputation: 76611
A base R solution with findInterval
, which is fast.
i <- findInterval(b, a)
sapply(seq_along(i), function(j)sum(b[j] - a[1:i[j]]))
# [1] 0.0 2.5 9.0 19.5 34.0 52.5 75.0 101.5 132.0 166.5
Upvotes: 2
Reputation: 671
Something like this?
library(data.table)
a <- seq(1, 10, by=0.25)
b <- seq(1, 10, by=1)
all.combinations <- CJ(a, b) # Get all possible combinations
all.combinations[b>=a, sum(b-a), by=b] # Filter for b>=a, then sum the difference for each value of b
Upvotes: 1