Yves
Yves

Reputation: 31

Match and find the earliest value in R

I have two tables, say one is Sales History, the other is Key Clients.

The Sales History table looks like this(contains 500,000 rows):

Client No.   Transaction Date   Sales Amount
       abc          1/12/2014            100
       def          2/28/2014            200
       hij          3/01/2014            300
       abc          2/18/2014            400
       abc          5/26/2014            500
       xyz          7/15/2014            600
       def          8/23/2014            700
       hij          9/19/2014            800

And the Key Clients table looks like this:

Client No.
       abc
       def
       xyz

Now what I want to do with R is that, based on the Key Clients table, I need to create a new table matching the earliest Sales Amount from the Sales History table. The desired results looks like this:

Client No.  Earliest Sales Amount
       abc                    100
       def                    200
       xyz                    600

Upvotes: 3

Views: 867

Answers (4)

JonMinton
JonMinton

Reputation: 1279

How about, using 'dplyr', something like the following? (not tested, using some easier-to-type labels)

sal_hist_first <- sales_hist %>% 
  mutate(trans_date=as.Date(trans_date, format="%m/%d/%Y")) %>%
  group_by(client_no) %>%
  summarise(earliest_sales_amount = sales_amount[trans_date == min(trans_date)])

result <- keys_clients %>% left_join(sal_hist_first)

Upvotes: 2

akrun
akrun

Reputation: 887118

If the 'Transaction Date' column is ordered, then we can use match

df2$EarliestSalesAmount <- df1[,3][match(df2[,1], df1[,1])]

Or we can use data.table to join both the datasets after setting the key as 'Client_No.'. order using 'Transaction_Date' after converting to 'Date' class, select the first row (.SD[1L]) by 'Client_No', assign the unwanted columns to NULL, and change the column names accordingly (setnames)

library(data.table)
 res <- setkey(setDT(df1), Client_No.)[df2][order(as.Date(Transaction_Date, 
         format='%m/%d/%Y')),.SD[1L] ,Client_No.][, 2:= NULL]
 setnames(res, 2, 'Earliest_Sales_Amount')
#    Client_No. Earliest_Sales_Amount
#1:        abc                   100
#2:        def                   200
#3:        xyz                   600

data

df1 <- structure(list(Client_No. = c("abc", "def", "hij", "abc", 
"abc", 
"xyz", "def", "hij"), Transaction_Date = c("1/12/2014", "2/28/2014", 
"3/01/2014", "2/18/2014", "5/26/2014", "7/15/2014", "8/23/2014", 
"9/19/2014"), Sales_Amount = c(100L, 200L, 300L, 400L, 500L, 
600L, 700L, 800L)), .Names = c("Client_No.", "Transaction_Date", 
"Sales_Amount"), class = "data.frame", row.names = c(NA, -8L))

df2 <- structure(list(Client_No. = c("abc", "def", "xyz")),
.Names = "Client_No.", class = "data.frame", row.names = c(NA, -3L))

Upvotes: 3

David Arenburg
David Arenburg

Reputation: 92292

I will also recommend using data.tables binary join, but I will utilize .EACHI for this

library(data.table)
setkey(setDT(df)[, Transaction.Date := as.IDate(Transaction.Date, "%m/%d/%Y")], Client.No.)
df[key, .(Earliest.Sales.Amont = Sales.Amount[which.min(Transaction.Date)]), by = .EACHI]
#    Client.No. Earliest.Sales.Amont
# 1:        abc                  100
# 2:        def                  200
# 3:        xyz                  600

First, we will convert your data set (df) to a data.table class using setDT, convert Transaction.Date to a Date class so R will understand what you mean while looking for the minimum date and then key it by Client.No. in order to perform a binary join (this is very efficient for a big data set like yours). Then, while performing the binary join on the key table (key) we will use by = .EACHI in order to locate the Sale.Amount within the earliest date in the data set.

Upvotes: 3

Plasher
Plasher

Reputation: 328

To give a more detailed answer it would be useful to know the used programming language/IDE and exact database system.

But assuming you use something like MySQL, MSSQL or PostgreSQL and you are able to execute SQL statements the following should give you an idea what to do.

First of all you could open the Sales History Table orderd by the Transaction date. Then you could iterate through it to find the respective Key clients. Or if your Database allows the TOP statement you could use a kind of the following SQL Statement for each Key Client:

SELECT TOP 1 Client_No, Sales_Amount FROM tbl_sales_history WHERE Client_No = '<KEY CLIENT>' ORDER BY Transaction_Date

This SQL Statement would return the first record for a key client of all the records ordered by the transaction date. Which is of course the earliest sale.

In short key to you question is to order your sales history by the transaction date.

EDIT: Just saw that the programming language used is R. So my answer is probably not that useful for this example, but it should work as a general help.

Upvotes: 2

Related Questions