Reputation: 213
I checked similar entries in SO, none answers my question exactly.
My problem is this: Let's say, User1 has 6 purchases, User2 has 2. Purchase data is something like this:
set.seed(1234)
purchase <- data.frame(id = c(rep("User1", 6), rep("User2", 2)),
purchaseid = sample(seq(1, 100, 1), 8),
purchaseDate = seq(Sys.Date(), Sys.Date() + 7, 1),
price = sample(seq(30, 200, 10), 8))
#
users <- data.frame(id = c("User1","User2"),
uname = c("name1", "name2"),
uaddress = c("add1", "add2"))
> purchase
id purchaseid purchaseDate price
1 User1 12 2019-09-27 140
2 User1 62 2019-09-28 110
3 User1 60 2019-09-29 200
4 User1 61 2019-09-30 190
5 User1 83 2019-10-01 60
6 User1 97 2019-10-02 150
7 User2 1 2019-10-03 160
8 User2 22 2019-10-04 120
End data required includes 1 row for each user, that keeps the user name, address, etc. Then comes next columns for 20 purchases. The purchase data needs to be placed one after another in the same row. This is the rule: only one row for each user. If the user does not have 20 purchases, the remaining fields should be empty.
End data should therefore look like this:
id uname uaddr p1id p1date p1price p2id p2date p2price p3id p3date p3price p4id
1 User1 name1 add1 12 2019-09-27 140 62 2019-09-28 110 60 2019-09-29 200 61
2 User2 name2 add2 1 2019-10-03 160 22 2019-10-04 120 NA <NA> NA NA
p4date p4price
1 2019-09-30 190
2 <NA> NA
enddata <- data.frame(id = c("User1", "User2"),
uname = c("name1", "name2"),
uaddr = c("add1", "add2"),
p1id = c(12,1),
p1date = c("2019-09-27","2019-10-03"),
p1price = c(140, 160),
p2id = c(62, 22),
p2date = c("2019-09-28", "2019-10-04"),
p2price = c(110, 120),
p3id = c(60, NA),
p3date = c("2019-09-29", NA),
p3price = c(200, NA),
p4id = c(61, NA),
p4date = c("2019-09-30", NA),
p4price = c(190, NA))
I used reshape to get the data for each user into the wide format. The idea was doing it in a loop for each user id. Then I used rbindlist
with the fill option TRUE, but this time I am having problem with column names. After reshape, each gets different column names. Without fixed number of columns, you cannot set names either.
Any elegant solution to this?
Upvotes: 1
Views: 72
Reputation: 25208
Another option using data.table
:
#pivot to wide format
setDT(users)
setDT(purchase)[, pno := rowid(id)]
ans <- dcast(purchase[users, on=.(id)], id + uname + uaddress ~ pno,
value.var=c("purchaseid","purchaseDate", "price"))
#reorder columns
nm <- grep("[1-9]$", names(ans), value=TRUE)
setcolorder(ans, c(setdiff(names(ans), nm), nm[order(gsub("(.*)_", "", nm))]))
ans
output:
id uname uaddress purchaseid_1 purchaseDate_1 price_1 purchaseid_2 purchaseDate_2 price_2 purchaseid_3 purchaseDate_3 price_3 purchaseid_4 purchaseDate_4 price_4 purchaseid_5 purchaseDate_5 price_5 purchaseid_6 purchaseDate_6 price_6
1: User1 name1 add1 12 2019-09-30 140 62 2019-10-01 110 60 2019-10-02 200 61 2019-10-03 190 83 2019-10-04 60 97 2019-10-05 150
2: User2 name2 add2 1 2019-10-06 160 22 2019-10-07 120 NA <NA> NA NA <NA> NA NA <NA> NA NA <NA> NA
Upvotes: 1
Reputation: 93881
There's no need to process each id separately. Instead we can operate by id within a single data frame. Below is a tidyverse
approach. You can stop the chain at any point to see the intermediate output. I've added comments to explain what the code is doing, but let me know if anything is unclear.
library(tidyverse)
dat = users %>%
# Join purchase data to user data
left_join(purchase) %>%
arrange(purchaseDate) %>%
# Create a count column to assign a sequence number to each purchase within each id.
# We'll use this later to create columns for each purchase event with a unique
# sequence number for each purchase.
group_by(id) %>%
mutate(seq=1:n()) %>%
ungroup %>%
# Reshape data frame to from "wide" to "long" format
gather(key, value, purchaseid:price) %>%
arrange(seq) %>%
# Paste together the "key" and "seq" columns (the resulting column will still be
# called "key"). This will allow us to spread the data frame to one row per id
# with each purchase event properly numbered.
unite(key, key, seq, sep="_") %>%
mutate(key = factor(key, levels=unique(key))) %>%
spread(key, value) %>%
# Convert date columns back to Date class
mutate_at(vars(matches("Date")), as.Date, origin="1970-01-01")
dat
id uname uaddress purchaseid_1 purchaseDate_1 price_1 purchaseid_2 purchaseDate_2 price_2 1 User1 name1 add1 12 2019-09-27 140 62 2019-09-28 110 2 User2 name2 add2 1 2019-10-03 160 22 2019-10-04 120 purchaseid_3 purchaseDate_3 price_3 purchaseid_4 purchaseDate_4 price_4 purchaseid_5 purchaseDate_5 1 60 2019-09-29 200 61 2019-09-30 190 83 2019-10-01 2 NA <NA> NA NA <NA> NA NA <NA> price_5 purchaseid_6 purchaseDate_6 price_6 1 60 97 2019-10-02 150 2 NA NA <NA> NA
Upvotes: 1