Julian Hui
Julian Hui

Reputation: 31

Using dcast.data.table to transform from long to wide only for subset of column values

I'm working on a market basket analysis project and have data that contains a variable that represents an ID, and another that contains only one item in a basket. There are ~50K unique items across users, but I have created a simple table below for illustration.

library(data.table)
dt = data.table(ID = c("1","1","1","1","2","2","2"),
                product = c("Soda","Beer","Chips","Apple","Soda","Water","Juice"))
dt # original data looks like this

I am then using the dcast function to transform each product into it's own column with binary values, indicating that they were part of the order.

dcast.data.table(dt, ID ~ product, function(x) 1, fill = 0) 

As I mentioned, I cannot use this method on the entire dataset due to memory limitations (since this would create 50K columns for each product). So, I am trying to find a solution where dcast only creates "product" columns based on the items only contained within ID ==1 (meaning, the columns "Juice" and Water" would be excluded). Also, I am working with a fairly large dataset of 34MM observations, so I am looking for an efficient solution where the data.table API can be leveraged and am specifically trying to avoid looping through products. I hope this question is clear. Thank you.

Upvotes: 2

Views: 492

Answers (1)

Frank
Frank

Reputation: 66819

This works:

dcast(dt, ID + product ~ factor(product, levels=product[ID==1]), fun.agg=length)

Using 'product' as value column. Use 'value.var' to override
   ID product Soda Beer Chips Apple NA
1:  1   Apple    0    0     0     1  0
2:  1    Beer    0    1     0     0  0
3:  1   Chips    0    0     1     0  0
4:  1    Soda    1    0     0     0  0
5:  2   Juice    0    0     0     0  1
6:  2    Soda    1    0     0     0  0
7:  2   Water    0    0     0     0  1

Upvotes: 2

Related Questions