Reputation: 21
I have 2 tables, "transactions" with over 500M rows and "Customers" over 3M rows
data <- data.frame(Trans = c(1,2,3,4,5), Cust01 = c("A","B","C","D","F"),
Cust02 = c("S","E","","TE","F"), Cust03 = c("F","","D","","F"))
cust_type <-data.frame(Cust = c("A","B","C","D"), Type = c("1","2","3","4"))
dataresult <- data.frame(Trans = c(1,2,3,4,5),
Cust01 = c("A","B","C","D","F"),
Cust01Type = c("1","2","3","4","5"),
Cust02 = c("S","E","","TE","F"),
Cust02Type = c("","","","",""),
Cust03 = c("F","","D","","F"),
Cust03Type = c("","","4","",""))
I would like to add the customer type to the data in an efficient way. Normally with sql
I will use multiple left join, I tried that with dplyr
but takes forever. I also tried to use %in%
with logic return and then a loop just to focus on the true values.
Does someone know a better way to do this?
Upvotes: 0
Views: 269
Reputation: 83215
When you want fast performance, nothing beats the data.table
package (yet). As your transaction data are now in wide format, the first step to do is convert it to long format. This will make it easier to process.
library(data.table) #v1.9.5
trans_data <- melt(setDT(data), id.vars = "Trans",
variable.name = "Cust", # set name variable column
variable.factor = TRUE, # set as a factor variable instead of a character variable
value.name = "Cvalue")[!Cvalue==""] # set name value column & remove empty cases
When you have done that, you can join the two datatables:
# set the keys by which you are joining
setDT(trans_data, key = "Cvalue")
setDT(cust_type, key = "Cust")
# join the customer type into the transaction data
trans_data[cust_type, Ctype:=Type]
this gives:
> trans_data Trans Cust Cvalue Ctype 1: 1 Cust01 A 1 2: 2 Cust01 B 2 3: 3 Cust01 C 3 4: 4 Cust01 D 4 5: 3 Cust03 D 4 6: 2 Cust02 E NA 7: 5 Cust01 F NA 8: 5 Cust02 F NA 9: 1 Cust03 F NA 10: 5 Cust03 F NA 11: 1 Cust02 S NA 12: 4 Cust02 TE NA
If you want to change the order in the resulting data.table
, you can do that with for example:
setorder(trans_data, Trans, Cust)
or all at once with:
trans_data <- trans_data[cust_type, Ctype:=Type][order(Trans,Cust)]
which gives:
> trans_data Trans Cust Cvalue Ctype 1: 1 Cust01 A 1 2: 1 Cust02 S NA 3: 1 Cust03 F NA 4: 2 Cust01 B 2 5: 2 Cust02 E NA 6: 3 Cust01 C 3 7: 3 Cust03 D 4 8: 4 Cust01 D 4 9: 4 Cust02 TE NA 10: 5 Cust01 F NA 11: 5 Cust02 F NA 12: 5 Cust03 F NA
Upvotes: 1