Todd Young
Todd Young

Reputation: 49

Data.table: creating new variables, subsetting by date periods, grouping by second variable, x, and aggregating results by final date of each x

I am completely new to data.table, so please bear with me. In the following example, I would like to create two new columns in my dataset: the total purchase price within two different time periods for each customer ID. So far, I have that working.

However, I get stuck when trying to subset the data. I would like to subset the data so that each row represents a unique customer ID along with their final purchase price, final purchase date, period 1 purchase price total, and their period 2 purchase price total.

I thought that data.table might replicate the computed totals over all rows corresponding to each customer. However, it only replicates those totals in rows corresponding to the period intervals specified in the table's i index. Since it does not replicate those totals over all rows for each customer, my final dplyr block does not do the trick.

In the second and third code blocks I will give the output of the final dplyr code and then what the output I would like to achieve.

library(lubridate)
library(data.table)
library(dplyr)

data <- data.frame(custid = c(rep(1, 4), rep(2, 4), rep(1, 4), rep(2, 4)),
                   purchase.price = seq(1, 32, by=2),
                   date = seq.Date(from=as.Date("2015-01-01"), to=as.Date("2015-01-16"), by="days"))

period_intervals <- list(period_one = interval(as.Date("2015-01-01"), as.Date("2015-01-09")),
                         period_two = interval(as.Date("2015-01-10"), as.Date("2015-01-16")))

data <- as.data.table(data)
data <- data[order(date)]
setkey(data, custid)

data <- data[date %within% period_intervals[[1]], 
             period.1.price.total := sum(purchase.price), 
             by = custid]

data <- data[date %within% period_intervals[[2]], 
             period.2.price.total := sum(purchase.price), 
             by = custid]

data_sub <- data %>%
  group_by(custid) %>%
  arrange(desc(date)) %>%
  filter(row_number() == 1)

Current result:

  custid purchase.price      date       period.1.price.total period.2.price.total
   <dbl>     <dbl>          <date>              <dbl>              <dbl>
     1         31         2015-01-16              NA                112
     2         23         2015-01-12              NA                 63

Finally, what I am aiming for:

 custid purchase.price      date       period.1.price.total period.2.price.total
   <dbl>     <dbl>          <date>              <dbl>              <dbl>
     1         31         2015-01-16              33                112
     2         23         2015-01-12              48                 63

Upvotes: 0

Views: 596

Answers (1)

akrun
akrun

Reputation: 887391

We can do this by subsetting the purchase.price with the index

data[, .(period.1.total.sum = sum(purchase.price[date %within% 
                     period_intervals[[1]]])),by = custid]

For creating the columns simultaneously, we can use Map

nm1 <- c('period.1. total.sum', 'period.2.total.sum')
data[,  (nm1) := Map(function(x,y) sum(purchase.price[x %within% y]), 
                   list(date), period_intervals), by = custid] 
data[order(custid, -date)][,.SD[1:.N==1] , custid]
#   custid purchase.price       date period.1. total.sum period.2.total.sum
#1:      1             23 2015-01-12                  33                 63
#2:      2             31 2015-01-16                  48                112

NOTE: The 'total' column creation code was not showed in the OP's post.

Upvotes: 1

Related Questions