Mario
Mario

Reputation: 316

Vectorization of for-loop for calculations between two datasets in R

I have a dataset A with a place, starting date and finish date. On the other hand, I have a dataset B also with a place, a date and number of cars.

library(data.table)
A <- data.table(Place = c(rep(c("Place_1","Place_2"), each = 20)),
          Start_date = as.Date("2010-01-15"),
          Finish_date = as.Date(rep(c("2011-03-01","2012-04-30","2012-01-20","2011-04-05"), each = 10)))


set.seed(1001)
B <- data.table(Date = rep(seq.Date(from = as.Date("2010-01-01"), to = as.Date("2013-01-01"), by="day"), 2),
          Place = rep(c("Place_1","Place_2"),each = 1097),
          Cars = round(runif(2194, 0, 10), 0))    

I need to calculate in the dataset A a new column (total of cars) which is the sum of cars in dataset B; this sum of cars must be for a specific place and within certain period of time.

This is easily made with a for-loop statement.

for (i in 1:nrow(A)) {
    A$Tcars[i] <- sum(B[Place == A$Place[i] & Date > A$Start_date[i] & Date < A$Finish_date[i]]$Cars)
}

But my real dataset has 30.000 rows and the loop option is inefficient and time consuming. So, I am looking for a vectorized way of doing this. I have tried the next code but it does not work:

A$Tcars<-sum(B[Place == A$Place & Date > A$Start_date & Date < A$Finish_date]$Cars)

Upvotes: 3

Views: 52

Answers (1)

Frank
Frank

Reputation: 66819

You can use a non-equi join to update the table:

library(data.table)

A[, n := B[.SD, on=.(Place, Date > Start_date, Date < Finish_date), 
  sum(Cars), by=.EACHI]$V1]

If you look at ?data.table and the other introductory materials listed when you first type library(data.table), you'll get some intuition for :=, on=, by=, etc.

Upvotes: 4

Related Questions