Reputation: 11

R programming - Conditionally Apply a Function to a Large Data Frame Sorted By Groups

I have a very large data frame (~800K rows) sorted by CustomerID, Earliest transaction {"Date"}, and Sales that looks like this....

CustomerID    Date          Sales

AAA123        1-01-2015     49.00
AAA123        1-02-2015     24.00
AAA123        1-03-2015     17.00
BBB456        2-01-2015     117.00
CCC789        1-01-2012     65.20
CCC789        12-12-2012    43.00

I am trying to conditionally aggregate this data.frame by in such a way that I only want to get one row for each "Customer" that shows his/her earliest repeat transaction (i.e. their 2nd transaction from the table above if they have more than one row for that "CustomerID") unless that "Customer" only has one single transaction, in which case I would want that "Customer's" sole transaction to show up in my results as their "earliest" transaction entry. So in essence, my resulting data.frame would look like this:

CustomerID    Date          Sales
AAA123        1-01-2015     49.00
BBB456        2-01-2015     117.00
CCC789        1-01-2012     65.20

I tried using

results <- do.call(rbind,by(old_data,old_data$CustomerID,function(x) x[-1,]))

But unfortunately I can't get it to work the way that I want. It instead deletes out out those "Customers" who have just a single transaction. Does anyone know of a way to conditionally apply a function like "do.call" that is fast, efficient and is easy to apply?

Upvotes: 1

Answers (2)

mshum

Reputation: 257

Try this (where "df" is the name of your data frame):

df <- df[order(df$CustomerID, df$Date),] 
df <- df[!duplicated(df$CustomerID),]

The first line sorts your data frame by CustomerID, then sorts each customer's transactions in order of Date. The order() function's default is to sort in increasing order, so that each customer's transactions would be listed in order of earliest to latest.

The second line removes all rows with duplicate CustomerID -- leaving you with each customer's first (or only) transaction.

Upvotes: 0

akrun

Reputation: 887851

For big datasets, data.table can be used efficiently. We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'CustomerID', we get the index of the minimum 'Date' and subset the dataset rows.

library(data.table)
library(lubridate)
setDT(df1)[, .SD[which.min(mdy(Date))] , by = CustomerID]
#    CustomerID      Date Sales
# 1:     AAA123 1-01-2015  49.0
# 2:     BBB456 2-01-2015 117.0
# 3:     CCC789 1-01-2012  65.2

Or we order by 'Date' after grouping with 'CustomerID' and then get the first element of Subset of Data.table (.SD).

setDT(df1)[order(mdy(Date)), head(.SD, 1L) , by =  CustomerID]

The above was based on the expected output showed by the OP. But, based on the description, it is the 2nd transaction that is required, in that case, we may need a condition

setDT(df1)[order(mdy(Date)), if(.N==1) .SD else .SD[2L], 
                            by = CustomerID]

Upvotes: 2

R programming - Conditionally Apply a Function to a Large Data Frame Sorted By Groups

Answers (2)

Related Questions