Reputation: 1821
This is what my customer order data looks like for a single customer:
order_no customer_id product amount order_total
23 1 A 100 100
24 1 A 100 300
24 1 B 100 300
24 1 C 100 300
25 1 B 100 100
26 1 A 100 200
26 1 B 100 200
I want to calculate the average order size for each customer in a new column, so for this customer it would be 175 = (100 + 300 + 100 + 200) / 4:
order_no customer_id amount order_total avg_order_size
23 1 100 100 175
24 1 100 300 175
24 1 100 300 175
24 1 100 300 175
25 1 100 100 175
26 1 100 200 175
26 1 100 200 175
I've tried to use some version of this, but no luck:
customer_stats <- data.table(customer_stats)[, avg_order_size := mean(order_total), by=list(order_no, customer_id)]
What I really need to do is select just one row from each order_no, something like mean
of all the order_no[1]
using by=(customer_id)
perhaps? If there's a way to do this in one-step and skip creating order_total
then that's even better.
Upvotes: 1
Views: 2617
Reputation: 9805
I think the trick is to key the original table by customer and order, sum the order total by customer and order, get the average order total by customer, and then join that back to the original table.
# Your data (next time, consider putting R-formatted data in the question...):
dt <- data.table(customer_id=1,
order_no=c(23,24,24,24,25,26,26),
product=c("A","A","B","C","B","A","B"),
product_amount=100,
key=c("customer_id","order_no")) # 1: key by customer and order
dt
# customer_id order_no product product_amount
#1: 1 23 A 100
#2: 1 24 A 100
#3: 1 24 B 100
#4: 1 24 C 100
#5: 1 25 B 100
#6: 1 26 A 100
#7: 1 26 B 100
dt[ # 4: join summary back to original
dt[,list(order_total=sum(product_amount)),by=list(customer_id,order_no)] [ # 2: order total by customer and order
,avg_order_size:=mean(order_total),by=list(customer_id)] # 3: add the average of order total by customer
]
# customer_id order_no product product_amount order_total avg_order_size
#1: 1 23 A 100 100 175
#2: 1 24 A 100 300 175
#3: 1 24 B 100 300 175
#4: 1 24 C 100 300 175
#5: 1 25 B 100 100 175
#6: 1 26 A 100 200 175
#7: 1 26 B 100 200 175
Upvotes: 0
Reputation: 18437
What about this one, it seems to translate your approach and no need to compute order_total
here.
dat[, sum(amount), by = list(customer_id, order_no)][ ,avg_order := mean(V1), by = customer_id]
Upvotes: 2
Reputation: 4432
You can avoid creating order_total
by doing this:
customer_stats[ , avg_order_size := sum(amount, na.rm=TRUE) / length(unique(order_no)), by=customer_id]
However, I have reservations about how fast this will be.
Upvotes: 0