at80
at80

Reputation: 790

Summing the difference of all values of one vector that are less than the values in another

I have the following code below to try and loop through a sequence and select values below these values in a sequence and find the difference from another value. For large datasets, this can take a long time. Is there a way to vectorize something like this without looping through the sequence to improve performance?

a <- seq(1, 10, by=0.25)
b <- seq(1, 10, by=1)

c <- vector('list', length(b))

i <- 1
for (n in b){
    c[[i]] <- sum(n - a[n >= a])
    i <- i + 1
}

data.frame(c)

I've tried to use data.table to bin the data and find the difference, but cannot figure out how to calculate the difference from every value less than the bin value.

library(data.table)

min.n <- 1
max.n <- 10 
a <- data.table(seq(min.n, max.n, by=0.5))
colnames(a) <- 'a'
b <- seq(min.n+1, max.n+1, by=1)

bins <- findInterval(a$a,b)
a[,bins:= bins+2]
a[, diff:= bins - a]

Upvotes: 3

Views: 688

Answers (4)

chinsoon12
chinsoon12

Reputation: 25225

Here is an option using data.table using rolling join:

library(data.table)
A <- data.table(a, key="a")
B <- data.table(b, key="b")

A[, c("N", "cs") := .(.I, cumsum(a))]

A[B, on=.(a=b), roll=Inf, N * b - cs]

sum a[a <= n] can be replaced with cumsum (i.e. cs here) and rolling join will find those a that are less than b. Replace sum(n - cs) with a mathematical formula involving the summation symbol so that sum(constant) = number of elements in summation * constant.

output:

[1]   0.0   2.5   9.0  19.5  34.0  52.5  75.0 101.5 132.0 166.5

edit: some timings for reference

timing code:

set.seed(0L)
library(data.table)
n <- 1e5L
a <- rnorm(n)
b <- rnorm(n/10L)
A <- data.table(a, key="a")
B <- data.table(b, key="b")

mtd0 <- function() A[B, on = .(a <= b), sum(i.b - x.a), by = .EACHI]$V1

mtd1 <- function() {
    A[, c("N", "cs") := .(.I, cumsum(a))]
    A[B, on=.(a=b), roll=Inf, N * b - cs]
}

all.equal(mtd0(), mtd1())
#[1] TRUE

microbenchmark::microbenchmark(times=1L, mtd0(), mtd1())

timings:

Unit: milliseconds
   expr         min          lq        mean      median          uq         max neval
 mtd0() 2998.208000 2998.208000 2998.208000 2998.208000 2998.208000 2998.208000     1
 mtd1()    7.807637    7.807637    7.807637    7.807637    7.807637    7.807637     1

Upvotes: 3

Uwe
Uwe

Reputation: 42564

With data.table, this can be achieved by aggregating in a non-equi join:

library(data.table)
data.table(a)[data.table(b), on = .(a <= b), sum(i.b - x.a), by = .EACHI]$V1
[1]   0.0   2.5   9.0  19.5  34.0  52.5  75.0 101.5 132.0 166.5

In a way, it is similar to MattB's approach but combines the cartesian product CJ() and subsetting in the non-equi join thereby avoiding to create data which will be filtered out subsequently.

Note that the x. prefix is required to pick the a column from the first data.table.


Alternatively, sum(i.b - x.a) can be re-written as .N * b - sum(x.a) where the special symbol .N denotes the number of rows in a group.

data.table(a)[data.table(b), on = .(a <= b), .N * b - sum(x.a), by = .EACHI]$V1
[1]   0.0   2.5   9.0  19.5  34.0  52.5  75.0 101.5 132.0 166.5

Upvotes: 3

Rui Barradas
Rui Barradas

Reputation: 76611

A base R solution with findInterval, which is fast.

i <- findInterval(b, a)
sapply(seq_along(i), function(j)sum(b[j] - a[1:i[j]]))
# [1]   0.0   2.5   9.0  19.5  34.0  52.5  75.0 101.5 132.0 166.5

Upvotes: 2

MattB
MattB

Reputation: 671

Something like this?

library(data.table)
a <- seq(1, 10, by=0.25)
b <- seq(1, 10, by=1)

all.combinations <- CJ(a, b)  # Get all possible combinations
all.combinations[b>=a, sum(b-a), by=b]  # Filter for b>=a, then sum the difference for each value of b

Upvotes: 1

Related Questions