aggregate time series data faster

Question

I need to calculate some features by distributing them over time and then aggregating as shown below. The code produces the right results but I have roughly 1 million rows of data in my actual set and run time with a code similar to below takes several days of my machine. I'm looking for a more efficient code. I'm not sure with xts or tidyverse packages would be useful here for a speedup. I worked with data.table thinking It would help with speed - maybe it's the wrong choice. Any ideas?

library(data.table)
library(lubridate)

#toy example
rows=1000
set.seed(1)
data=data.table(
  customer.arv = as.POSIXct("2020-01-01 00:00")+dminutes(sample(1:(60*24*7),rows,replace = T)),
  location = sample(1:4,rows,replace = T),
  customer.type = sample(LETTERS[1:5],rows,replace = T),
  charge = sample(seq(50,200,10),rows,replace = T)
  )
data[,':='(customer.dep = customer.arv+dminutes(sample(1:500,rows,replace = T)),
           arv.time.floor = floor_date(customer.arv,"hours"),
           arv.hour = hour(customer.arv))]

#distribute the charge over the length of stay (departure-arrival) and calculate the hourly charge
tot.hourly.charge = function(pass.location,pass.arv.time.floor,pass.customer.type) {
  full.hr.cust = data[customer.arv<=pass.arv.time.floor&customer.dep>=pass.arv.time.floor+dhours(1)&location==pass.location&customer.type==pass.customer.type,sum(charge)]
  partial.hr.cust = data[customer.arv<=pass.arv.time.floor&customer.deppass.arv.time.floor&location==pass.location&customer.type==pass.customer.type,sum(charge*minute(customer.dep)/60)]
  return(full.hr.cust+partial.hr.cust)
}

#aggregate
res = data[,.(hourly.charge = tot.hourly.charge(location,arv.time.floor,customer.type)), by=.(location,arv.time.floor,customer.type)]

#sample output
res[order(location,customer.type,arv.time.floor)][1:10,]
    location      arv.time.floor customer.type hourly.charge
 1:        1 2020-01-01 00:00:00             A       0.00000
 2:        1 2020-01-01 03:00:00             A     190.00000
 3:        1 2020-01-01 06:00:00             A     216.66667
 4:        1 2020-01-01 09:00:00             A     100.00000
 5:        1 2020-01-01 12:00:00             A     100.00000
 6:        1 2020-01-01 14:00:00             A      16.66667
 7:        1 2020-01-01 15:00:00             A      50.00000
 8:        1 2020-01-01 18:00:00             A      62.50000
 9:        1 2020-01-01 20:00:00             A       0.00000
10:        1 2020-01-01 22:00:00             A     190.00000

aggregate time series data faster

Answers (1)

Related Questions