Reputation: 329
I want to convert the following dataset :
transaction_id productsku
1 SK0001
1 SK0002
2 AB0001
2 AC0001
2 AC0002
3 BC0001
4 BC0002
The ideal dataset is:
transaction_id x1 x2 x3
1 SK0001 SK0002
2 AB0001 AC0001 AC0002
3 BC0001
4 BC0002
So,I use the following code to convert,but it failed.
order <- cast(order_0, transaction_id ~ productsku)
Using productsku as value column. Use the value argument to cast to override this choice
Error in `[.data.frame`(data, , variables, drop = FALSE) :
undefined columns selected
Upvotes: 1
Views: 772
Reputation: 13807
A simple and fast alternative using data.table
in 2 steps
library(data.table)
# convert mydata into a data.table
setDT(mydata)
# step 1: gather productsku values by transaction id
temp <- df[, .(product = toString(productsku)), by = list(transaction_id)]
# step 2: separate productsku values in different columns
temp[, c("x1", "x2", "x3") := tstrsplit(product, ",", fill="")] # you can also use fill=NA
temp
#> transaction_id product x1 x2 x3
#> 1: 1 SK0001, SK0002 SK0001 SK0002
#> 2: 2 AB0001, AC0001, AC0002 AB0001 AC0001 AC0002
#> 3: 3 BC0001 BC0001
#> 4: 4 BC0002 BC0002
another fast alternative using dcast{data.table}
that gives a slightly different output:
# Using dcast
dcast(df, transaction_id~productsku)
#> transaction_id AB0001 AC0001 AC0002 BC0001 BC0002 SK0001 SK0002
#> 1: 1 NA NA NA NA NA SK0001 SK0002
#> 2: 2 AB0001 AC0001 AC0002 NA NA NA NA
#> 3: 3 NA NA NA BC0001 NA NA NA
#> 4: 4 NA NA NA NA BC0002 NA NA
Upvotes: 0
Reputation: 32548
Try splitting based on transation_id
and then obtaining the productsku
for each group. Then you can rbind
the list while subsetting each element of the list to be able to include the element with maximum number of productsku
.
L = lapply(split(df, df$transaction_id), function(a) a$productsku)
max_length = max(lengths(L))
do.call(rbind, lapply(L, function(a) a[1:max_length]))
# [,1] [,2] [,3]
#1 "SK0001" "SK0002" NA
#2 "AB0001" "AC0001" "AC0002"
#3 "BC0001" NA NA
#4 "BC0002" NA NA
df = structure(list(transaction_id = c(1L, 1L, 2L, 2L, 2L, 3L, 4L),
productsku = c("SK0001", "SK0002", "AB0001", "AC0001", "AC0002",
"BC0001", "BC0002")), .Names = c("transaction_id", "productsku"
), class = "data.frame", row.names = c(NA, -7L))
Upvotes: 1
Reputation: 2434
Here is a way. The idea is to combine the variable in the same group and then split them into different columns using separate
:
library(tidyverse)
df %>%
group_by(transaction_id) %>%
summarise(product=paste(productsku, collapse=", ")) %>%
separate(product, c("x1", "x2", "x3"), sep=", ")
# A tibble: 4 × 4
transaction_id x1 x2 x3
* <int> <chr> <chr> <chr>
1 1 SK0001 SK0002 <NA>
2 2 AB0001 AC0001 AC0002
3 3 BC0001 <NA> <NA>
4 4 BC0002 <NA> <NA>
Warning message:
Too few values at 3 locations: 1, 3, 4
Upvotes: 0