Reputation: 640
I have data in this format in R
customer_key item_key units
2669699 16865 1.00
2669699 16866 1.00
2669699 46963 2.00
2685256 55271 1.00
2685256 43458 1.00
2685256 54977 1.00
2685256 2533 1.00
2685256 55011 1.00
2685256 44785 2.00
but I want to get the unique head_key as column and I want my the other variables name be the unique values in item_key and their value would be the units like this
customer_key '16865' '16866' '46963' '55271' '43458' '54977' '2533'
2669699 1.00 1.00 1.00 0.00 0.00 0.00 0.00
2685256 0.00 0.00 0.00 1.00 1.00 1.00 2.00
Please help me transform my data for cluster analysis
Upvotes: 4
Views: 120
Reputation: 23574
Here is one way.
library(tidyr)
spread(mydf,item_key, units, fill = 0)
# customer_key 2533 16865 16866 43458 44785 46963 54977 55011 55271
#1 2669699 0 1 1 0 0 2 0 0 0
#2 2685256 1 0 0 1 2 0 1 1 1
Upvotes: 3
Reputation: 193517
Since the packages have been covered (+1 to you all), here are a couple of base solutions to join the party:
xtabs
:
xtabs(units ~ customer_key + item_key, df)
# item_key
# customer_key 2533 16865 16866 43458 44785 46963 54977 55011 55271
# 2669699 0 1 1 0 0 2 0 0 0
# 2685256 1 0 0 1 2 0 1 1 1
reshape
reshape(df, direction = "wide", idvar = "customer_key", timevar = "item_key")
# customer_key units.16865 units.16866 units.46963 units.55271
# 1 2669699 1 1 2 NA
# 4 2685256 NA NA NA 1
# units.43458 units.54977 units.2533 units.55011 units.44785
# 1 NA NA NA NA NA
# 4 1 1 1 1 2
Upvotes: 3
Reputation: 92292
That's just a simple dcast
task. Assuming df
is your data set
library(reshape2)
dcast(df, customer_key ~ item_key , value.var = "units", fill = 0)
# customer_key 2533 16865 16866 43458 44785 46963 54977 55011 55271
# 1 2669699 0 1 1 0 0 2 0 0 0
# 2 2685256 1 0 0 1 2 0 1 1 1
Upvotes: 3
Reputation: 3501
library(dplyr); library(tidyr)
df2 <- df %>% arrange(item_key) %>% spread(item_key, units, fill=0)
df2
# customer_key 2533 16865 16866 43458 44785 46963 54977 55011 55271
# 1 2669699 0 1 1 0 0 2 0 0 0
# 2 2685256 1 0 0 1 2 0 1 1 1
Data
df <- structure(list(customer_key = c(2669699L, 2669699L, 2669699L,
2685256L, 2685256L, 2685256L, 2685256L, 2685256L, 2685256L),
item_key = c(16865L, 16866L, 46963L, 55271L, 43458L, 54977L,
2533L, 55011L, 44785L), units = c(1, 1, 2, 1, 1, 1, 1, 1,
2)), .Names = c("customer_key", "item_key", "units"), class = "data.frame", row.names = c(NA,
-9L))
Upvotes: 2