Reputation: 11
I have a very large data frame (~800K rows) sorted by CustomerID, Earliest transaction {"Date"}, and Sales that looks like this....
CustomerID Date Sales
AAA123 1-01-2015 49.00
AAA123 1-02-2015 24.00
AAA123 1-03-2015 17.00
BBB456 2-01-2015 117.00
CCC789 1-01-2012 65.20
CCC789 12-12-2012 43.00
I am trying to conditionally aggregate this data.frame by in such a way that I only want to get one row for each "Customer" that shows his/her earliest repeat transaction (i.e. their 2nd transaction from the table above if they have more than one row for that "CustomerID") unless that "Customer" only has one single transaction, in which case I would want that "Customer's" sole transaction to show up in my results as their "earliest" transaction entry. So in essence, my resulting data.frame would look like this:
CustomerID Date Sales
AAA123 1-01-2015 49.00
BBB456 2-01-2015 117.00
CCC789 1-01-2012 65.20
I tried using
results <- do.call(rbind,by(old_data,old_data$CustomerID,function(x) x[-1,]))
But unfortunately I can't get it to work the way that I want. It instead deletes out out those "Customers" who have just a single transaction. Does anyone know of a way to conditionally apply a function like "do.call" that is fast, efficient and is easy to apply?
Upvotes: 1
Views: 109
Reputation: 257
Try this (where "df" is the name of your data frame):
df <- df[order(df$CustomerID, df$Date),]
df <- df[!duplicated(df$CustomerID),]
The first line sorts your data frame by CustomerID, then sorts each customer's transactions in order of Date. The order() function's default is to sort in increasing order, so that each customer's transactions would be listed in order of earliest to latest.
The second line removes all rows with duplicate CustomerID -- leaving you with each customer's first (or only) transaction.
Upvotes: 0
Reputation: 887851
For big datasets, data.table
can be used efficiently. We convert the 'data.frame' to 'data.table' (setDT(df1)
), grouped by 'CustomerID', we get the index of the minimum 'Date' and subset the dataset rows.
library(data.table)
library(lubridate)
setDT(df1)[, .SD[which.min(mdy(Date))] , by = CustomerID]
# CustomerID Date Sales
# 1: AAA123 1-01-2015 49.0
# 2: BBB456 2-01-2015 117.0
# 3: CCC789 1-01-2012 65.2
Or we order
by 'Date' after grouping with 'CustomerID' and then get the first element of Subset of Data.table (.SD
).
setDT(df1)[order(mdy(Date)), head(.SD, 1L) , by = CustomerID]
The above was based on the expected output showed by the OP. But, based on the description, it is the 2nd transaction that is required, in that case, we may need a condition
setDT(df1)[order(mdy(Date)), if(.N==1) .SD else .SD[2L],
by = CustomerID]
Upvotes: 2