Bryan
Bryan

Reputation: 1821

Using R data.table to aggregate select rows

This is what my customer order data looks like for a single customer:

order_no customer_id  product amount  order_total
      23           1        A    100          100 
      24           1        A    100          300
      24           1        B    100          300
      24           1        C    100          300
      25           1        B    100          100
      26           1        A    100          200
      26           1        B    100          200

I want to calculate the average order size for each customer in a new column, so for this customer it would be 175 = (100 + 300 + 100 + 200) / 4:

order_no customer_id  amount  order_total  avg_order_size
      23           1     100          100             175
      24           1     100          300             175
      24           1     100          300             175
      24           1     100          300             175
      25           1     100          100             175
      26           1     100          200             175
      26           1     100          200             175

I've tried to use some version of this, but no luck:

customer_stats <- data.table(customer_stats)[, avg_order_size := mean(order_total), by=list(order_no, customer_id)]

What I really need to do is select just one row from each order_no, something like mean of all the order_no[1] using by=(customer_id) perhaps? If there's a way to do this in one-step and skip creating order_total then that's even better.

Upvotes: 1

Views: 2617

Answers (3)

dnlbrky
dnlbrky

Reputation: 9805

I think the trick is to key the original table by customer and order, sum the order total by customer and order, get the average order total by customer, and then join that back to the original table.

# Your data (next time, consider putting R-formatted data in the question...):
dt <- data.table(customer_id=1,
                 order_no=c(23,24,24,24,25,26,26),
                 product=c("A","A","B","C","B","A","B"),
                 product_amount=100,
                 key=c("customer_id","order_no")) # 1: key by customer and order

dt
#   customer_id order_no product product_amount
#1:           1       23       A            100
#2:           1       24       A            100
#3:           1       24       B            100
#4:           1       24       C            100
#5:           1       25       B            100
#6:           1       26       A            100
#7:           1       26       B            100

dt[ # 4: join summary back to original
  dt[,list(order_total=sum(product_amount)),by=list(customer_id,order_no)] [ # 2: order total by customer and order
    ,avg_order_size:=mean(order_total),by=list(customer_id)] # 3: add the average of order total by customer
  ]
#   customer_id order_no product product_amount order_total avg_order_size
#1:           1       23       A            100         100            175
#2:           1       24       A            100         300            175
#3:           1       24       B            100         300            175
#4:           1       24       C            100         300            175
#5:           1       25       B            100         100            175
#6:           1       26       A            100         200            175
#7:           1       26       B            100         200            175

Upvotes: 0

dickoa
dickoa

Reputation: 18437

What about this one, it seems to translate your approach and no need to compute order_total here.

dat[, sum(amount), by = list(customer_id, order_no)][ ,avg_order := mean(V1), by = customer_id]

Upvotes: 2

asb
asb

Reputation: 4432

You can avoid creating order_total by doing this:

customer_stats[ , avg_order_size := sum(amount, na.rm=TRUE) / length(unique(order_no)), by=customer_id]

However, I have reservations about how fast this will be.

Upvotes: 0

Related Questions