Rya
Rya

Reputation: 329

convert column values into a single row by id

I want to convert the following dataset :

transaction_id productsku
1              SK0001
1              SK0002
2              AB0001
2              AC0001
2              AC0002
3              BC0001
4              BC0002

The ideal dataset is:

transaction_id x1       x2      x3
1              SK0001   SK0002
2              AB0001   AC0001  AC0002
3              BC0001
4              BC0002

So,I use the following code to convert,but it failed.

order <- cast(order_0, transaction_id ~ productsku)

Using productsku as value column.  Use the value argument to cast to override this choice
Error in `[.data.frame`(data, , variables, drop = FALSE) : 
  undefined columns selected

Upvotes: 1

Views: 772

Answers (3)

rafa.pereira
rafa.pereira

Reputation: 13807

A simple and fast alternative using data.table in 2 steps

library(data.table)

# convert mydata into a data.table
  setDT(mydata)

# step 1: gather productsku values by transaction id
  temp <- df[, .(product = toString(productsku)), by = list(transaction_id)]

# step 2: separate productsku values in different columns
  temp[, c("x1", "x2", "x3") := tstrsplit(product, ",", fill="")] # you can also use fill=NA

temp
#>    transaction_id                product     x1      x2      x3
#> 1:              1         SK0001, SK0002 SK0001  SK0002        
#> 2:              2 AB0001, AC0001, AC0002 AB0001  AC0001  AC0002
#> 3:              3                 BC0001 BC0001                
#> 4:              4                 BC0002 BC0002    

another fast alternative using dcast{data.table} that gives a slightly different output:

# Using dcast
  dcast(df, transaction_id~productsku)

#>    transaction_id AB0001 AC0001 AC0002 BC0001 BC0002 SK0001 SK0002
#> 1:              1     NA     NA     NA     NA     NA SK0001 SK0002
#> 2:              2 AB0001 AC0001 AC0002     NA     NA     NA     NA
#> 3:              3     NA     NA     NA BC0001     NA     NA     NA
#> 4:              4     NA     NA     NA     NA BC0002     NA     NA

Upvotes: 0

d.b
d.b

Reputation: 32548

Try splitting based on transation_id and then obtaining the productsku for each group. Then you can rbind the list while subsetting each element of the list to be able to include the element with maximum number of productsku.

L = lapply(split(df, df$transaction_id), function(a) a$productsku)
max_length = max(lengths(L))
do.call(rbind, lapply(L, function(a) a[1:max_length]))
#  [,1]     [,2]     [,3]    
#1 "SK0001" "SK0002" NA      
#2 "AB0001" "AC0001" "AC0002"
#3 "BC0001" NA       NA      
#4 "BC0002" NA       NA


DATA

df = structure(list(transaction_id = c(1L, 1L, 2L, 2L, 2L, 3L, 4L), 
    productsku = c("SK0001", "SK0002", "AB0001", "AC0001", "AC0002", 
    "BC0001", "BC0002")), .Names = c("transaction_id", "productsku"
    ), class = "data.frame", row.names = c(NA, -7L))

Upvotes: 1

JasonWang
JasonWang

Reputation: 2434

Here is a way. The idea is to combine the variable in the same group and then split them into different columns using separate:

library(tidyverse)
df %>% 
  group_by(transaction_id) %>%
  summarise(product=paste(productsku, collapse=", ")) %>%
  separate(product, c("x1", "x2", "x3"), sep=", ")

# A tibble: 4 × 4
  transaction_id     x1     x2     x3
*          <int>  <chr>  <chr>  <chr>
1              1 SK0001 SK0002   <NA>
2              2 AB0001 AC0001 AC0002
3              3 BC0001   <NA>   <NA>
4              4 BC0002   <NA>   <NA>
Warning message:
Too few values at 3 locations: 1, 3, 4 

Upvotes: 0

Related Questions