Aggregating to remove duplicates in a data table based on the minimum in one column

Question

I am very new to the data.table package in R. The version of data.table is 1.8.2.

My data table has about 21 million rows so ideally I would love to use the data table method to solve my problem as it is the way to deal with big data these days. Below is the sample data and code:

samp_data <- data.frame(user1 = c(24, 24, 24, 56, 75, 75),
                        user2 = c(43, 43, 57, 34, 61, 61),  
                        amount1 = c(1, 4, 3, 2, 6, 8), 
                        amount2 = c(4, 7, 9, 3, 5, 6), 
                        PURCH_DATE_1 = as.Date(c("2012-01-01", "2012-04-29", 
                        "2012-03-02", "2012-06-15", "2012-03-17", "2012-09-25")), 
                        PURCH_DATE_2 = as.Date(c("2012-04-01", "2012-01-25", 
                        "2012-05-21","2012-08-18", "2012-04-03", "2012-10-29")))
samp_data$DIFF_DAYS <- abs(as.numeric(samp_data$PURCH_DATE_1-
samp_data$PURCH_DATE_2))
samp_data_new<-data.table(samp_data)

I would like to get back a data table that has the 7 original columns but where there are duplicate pairs in the user1 and user2 column, the row which has the minimum value in the DIFF_DAYS column will be kept.

Assuming that I have confused some of you with what I want, the code below contains the desired output:

samp_data_desired<-data.frame(user1=c(24,24,56,75),user2=c(43,57,34,61),
                              amount1=c(1,3,2,6),amount2=c(4,9,3,5),
                              PURCH_DATE_1=as.Date(c("2012-01-01","2012-03-02",
                              "2012-06-15","2012-03-17")),
                              PURCH_DATE_2=as.Date(c("2012-04-01","2012-05-21",
                              "2012-08-18","2012-04-03")),
                              DIFF_DAYS=c(91,80,64,17))

I know how to perform simple aggregations to find the mean, min and max for each user1 using the DT[, min(col1), by=user1] sort of idea but I was unable to successfully use the unique or duplicated functions. I tried:

samp_data_check <- data.table(samp_data, key=c("user1", "user2", 
                            "amount1", "amount2", "PURCH_DATE_1",
                            "PURCH_DATE_2"))

and

samp_data_test <- samp_data_check[, unique(DIFF_DAYS), by=c("user1", 
                  "user2", "amount1", "amount2", "PURCH_DATE_1", 
                  "PURCH_DATE_2")]

along with some variations but I'm getting horribly confused so any help will be much appreciated.

Arun · Accepted Answer

First way I could think of (doesn't have much to do with data.table except setting key column to DIFF_DAYS). Assuming your data.table is DT:

setkey(DT, "DIFF_DAYS")
DT[!duplicated(DT[, c("user1", "user2")])]

Another method (more data.table):

setkey(DT, "user1", "user2", "DIFF_DAYS")
key.DT <- unique(DT[, 1:2])
DT[key.DT, mult = "first"]

Aggregating to remove duplicates in a data table based on the minimum in one column

Answers (1)

Related Questions