Reputation: 93
I am using R. I have a big dataset of more than 400K lines. Here is the code to reproduce 5 lines of similar data frame:
Date = as.Date(c("2013-01-03", "2013-01-03", "2013-01-04", "2013-01-04", "2013-01-05"))
CustomerID = as.factor(c("A", "B", "A", "C", "A"))
PurchaseNS = c(13, 14, 12, 8, 10)
df = data.frame(Date, CustomerID, PurchaseNS)
> df
Date CustomerID PurchaseNS
1 2013-01-03 A 13
2 2013-01-03 B 14
3 2013-01-04 A 12
4 2013-01-04 C 8
5 2013-01-05 A 10
What I need is to add an additional column that for an each row equals to the sum of previous purchases of this customer. So in the end I am trying to get next data frame:
> df
Date CustomerID PurchaseNS previousPurchases
1 2013-01-03 A 13 0
2 2013-01-03 B 14 0
3 2013-01-04 A 12 13
4 2013-01-04 C 8 0
5 2013-01-05 A 10 25
I can achieve that with for loop, but it takes too much time and I know that it's not recommended to use loops in R.
for (i in 1:nrow(df)) {
df[i, 4] = sum(subset(df, df$CustomerID == df$CustomerID[i] & df$Date < df$Date[i])$PurchaseNS)
}
I also tried using sapply
, but the code ends up looking similar to the one above and also takes too much time.
sapply(1:nrow(df), function(i) df[i, 4] =
sum(subset(df, df$CustomerID == df$CustomerID[i] & df$Date < df$Date[i])$Purchase))
I guess, functions like by
, with
, cumsum
, apply
could be useful, but so far I was not able to apply them.
Thanks in advance for your suggestions!
Upvotes: 1
Views: 1337
Reputation: 52637
You can try:
df$prevPurch <- ave(
df$PurchaseNS, df$CustomerID,
FUN=function(x) cumsum(c(0, head(x, -1)))
)
which produces:
# Date CustomerID PurchaseNS prevPurch
# 1 2013-01-03 A 13 0
# 2 2013-01-03 B 14 0
# 3 2013-01-04 A 12 13
# 4 2013-01-04 C 8 0
# 5 2013-01-05 A 10 25
ave
breaks up a vector by the groups in another vector, and then applies a function to each group.
Upvotes: 2