maop
maop

Reputation: 213

How to rbind reshaped data tables of different column sizes and with different names

I checked similar entries in SO, none answers my question exactly.

My problem is this: Let's say, User1 has 6 purchases, User2 has 2. Purchase data is something like this:

set.seed(1234)
purchase <- data.frame(id = c(rep("User1", 6), rep("User2", 2)),
                       purchaseid = sample(seq(1, 100, 1), 8),
                       purchaseDate = seq(Sys.Date(), Sys.Date() + 7, 1),
                       price = sample(seq(30, 200, 10), 8))
#
users <- data.frame(id = c("User1","User2"),
                    uname = c("name1", "name2"),
                    uaddress = c("add1", "add2"))
> purchase
     id purchaseid purchaseDate price
1 User1         12   2019-09-27   140
2 User1         62   2019-09-28   110
3 User1         60   2019-09-29   200
4 User1         61   2019-09-30   190
5 User1         83   2019-10-01    60
6 User1         97   2019-10-02   150
7 User2          1   2019-10-03   160
8 User2         22   2019-10-04   120

End data required includes 1 row for each user, that keeps the user name, address, etc. Then comes next columns for 20 purchases. The purchase data needs to be placed one after another in the same row. This is the rule: only one row for each user. If the user does not have 20 purchases, the remaining fields should be empty.

End data should therefore look like this:

  id uname uaddr p1id     p1date p1price p2id     p2date p2price p3id     p3date p3price p4id
1 User1 name1  add1   12 2019-09-27     140   62 2019-09-28     110   60 2019-09-29     200   61
2 User2 name2  add2    1 2019-10-03     160   22 2019-10-04     120   NA       <NA>      NA   NA
      p4date p4price
1 2019-09-30     190
2       <NA>      NA
enddata <- data.frame(id = c("User1", "User2"),
                      uname = c("name1", "name2"),
                      uaddr = c("add1", "add2"),
                      p1id = c(12,1),
                      p1date = c("2019-09-27","2019-10-03"),
                      p1price = c(140, 160),
                      p2id = c(62, 22),
                      p2date = c("2019-09-28", "2019-10-04"),
                      p2price = c(110, 120),
                      p3id = c(60, NA),
                      p3date = c("2019-09-29", NA),
                      p3price = c(200, NA),
                      p4id = c(61, NA),
                      p4date = c("2019-09-30", NA),
                      p4price = c(190, NA))

I used reshape to get the data for each user into the wide format. The idea was doing it in a loop for each user id. Then I used rbindlist with the fill option TRUE, but this time I am having problem with column names. After reshape, each gets different column names. Without fixed number of columns, you cannot set names either.

Any elegant solution to this?

Upvotes: 1

Views: 72

Answers (2)

chinsoon12
chinsoon12

Reputation: 25208

Another option using data.table:

#pivot to wide format
setDT(users)
setDT(purchase)[, pno := rowid(id)]
ans <- dcast(purchase[users, on=.(id)], id + uname + uaddress ~ pno, 
    value.var=c("purchaseid","purchaseDate", "price"))

#reorder columns
nm <- grep("[1-9]$", names(ans), value=TRUE)
setcolorder(ans, c(setdiff(names(ans), nm), nm[order(gsub("(.*)_", "", nm))]))
ans

output:

      id uname uaddress purchaseid_1 purchaseDate_1 price_1 purchaseid_2 purchaseDate_2 price_2 purchaseid_3 purchaseDate_3 price_3 purchaseid_4 purchaseDate_4 price_4 purchaseid_5 purchaseDate_5 price_5 purchaseid_6 purchaseDate_6 price_6
1: User1 name1     add1           12     2019-09-30     140           62     2019-10-01     110           60     2019-10-02     200           61     2019-10-03     190           83     2019-10-04      60           97     2019-10-05     150
2: User2 name2     add2            1     2019-10-06     160           22     2019-10-07     120           NA           <NA>      NA           NA           <NA>      NA           NA           <NA>      NA           NA           <NA>      NA

Upvotes: 1

eipi10
eipi10

Reputation: 93881

There's no need to process each id separately. Instead we can operate by id within a single data frame. Below is a tidyverse approach. You can stop the chain at any point to see the intermediate output. I've added comments to explain what the code is doing, but let me know if anything is unclear.

library(tidyverse)

dat = users %>% 
  # Join purchase data to user data
  left_join(purchase) %>% 
  arrange(purchaseDate) %>% 
  # Create a count column to assign a sequence number to each purchase within each id.
  # We'll use this later to create columns for each purchase event with a unique 
  # sequence number for each purchase.
  group_by(id) %>% 
  mutate(seq=1:n()) %>% 
  ungroup %>% 
  # Reshape data frame to from "wide" to "long" format
  gather(key, value, purchaseid:price) %>% 
  arrange(seq) %>% 
  # Paste together the "key" and "seq" columns (the resulting column will still be 
  # called "key"). This will allow us to spread the data frame to one row per id 
  # with each purchase event properly numbered.
  unite(key, key, seq, sep="_") %>% 
  mutate(key = factor(key, levels=unique(key))) %>% 
  spread(key, value) %>% 
  # Convert date columns back to Date class
  mutate_at(vars(matches("Date")), as.Date, origin="1970-01-01")

dat
     id uname uaddress purchaseid_1 purchaseDate_1 price_1 purchaseid_2 purchaseDate_2 price_2
1 User1 name1     add1           12     2019-09-27     140           62     2019-09-28     110
2 User2 name2     add2            1     2019-10-03     160           22     2019-10-04     120
  purchaseid_3 purchaseDate_3 price_3 purchaseid_4 purchaseDate_4 price_4 purchaseid_5 purchaseDate_5
1           60     2019-09-29     200           61     2019-09-30     190           83     2019-10-01
2           NA           <NA>      NA           NA           <NA>      NA           NA           <NA>
  price_5 purchaseid_6 purchaseDate_6 price_6
1      60           97     2019-10-02     150
2      NA           NA           <NA>      NA

Upvotes: 1

Related Questions