James Snay
James Snay

Reputation: 13

How can I reshape data from long to wide

** Sample data added after comment**

What I have:

pmts <- data.frame(stringsAsFactors=FALSE,
           name = c("johndoe", "johndoe", "janedoe", "foo", "foo", "foo"),
           pmt_amount = c(550L, 550L, 995L, 375L, 375L, 375L),
           pmt_date = c("9/1/16", "11/1/16", "12/15/16", "1/5/17", "3/5/17", "5/5/17")
)

#>      name pmt_amount pmt_date
#> 1 johndoe        550   9/1/16
#> 2 johndoe        550  11/1/16
#> 3 janedoe        995 12/15/16
#> 4     foo        375   1/5/17
#> 5     foo        375   3/5/17
#> 6     foo        375   5/5/17

What I am looking to achieve:

read.table(header = T, text = 
"name    pmt_amount  first_pmt   second_pmt  third_pmt
johndoe    550        9/1/16       11/1/16    NA
  janedoe    995        12/15/16       NA       NA
  foo       375        1/5/17       3/5/17   5/5/17"
)

#>      name pmt_amount first_pmt second_pmt third_pmt
#> 1 johndoe        550    9/1/16    11/1/16      <NA>
#> 2 janedoe        995  12/15/16       <NA>      <NA>
#> 3     foo        375    1/5/17     3/5/17    5/5/17

** End of update**

I have a large dataset with payment information for different products. Some of these products have a pay-in-full option as well as a two-pay and three-pay option. I need to create fields that would be First_Payment, Second_Payment, and Third_Payment and would populate NA in the respective fields if there was only one or two payments.

I've tried a couple options and the best workaround I have thus far is this:

pmts %>%
  group_by(Email, Name, Amount, Form.Title) %>%
  summarise(First_Payment = min(Payment.Date),
           Second_Payment = median(Payment.Date),
           Last_Payment = max(Payment.Date)) -> pmts

This obviously is not ideal as is making up a payment date for the 2-pay plans and I would have to instruct the end-user to ignore this field and just look at the 1st and 3rd fields.

I also tried to summarise with partial sorts like this:

n <- length(pmts$Payment.Date)
sort(pmts$Payment.Date,partial=n-1)[n-1]

However, if there wasn't three payments for the person, it would take the n-1 date from the entire data set and apply to all other fields.

Ideally, I would have it so if it was a pay-in-full the the First_Payment field would have the date and the 2nd/3rd fields would say NA. The 2-pay would have 1st and 2nd dates and the 3rd field would say NA. And finally the 3 pay would have all 3 dates.

The end users here are not super data savvy so I'm trying to make this as easy to interpret as possible. Any suggestions would be tremendously appreciated. Thank you!

Upvotes: 1

Views: 74

Answers (2)

David Arenburg
David Arenburg

Reputation: 92300

Using data.table this is a simple one-liner

library(data.table) #v1.9.8+
dcast(setDT(pmts), name + pmt_amount ~ rowid(pmt_amount))
# Using 'pmt_date' as value column. Use 'value.var' to override
#       name pmt_amount        1       2      3
# 1:     foo        375   1/5/17  3/5/17 5/5/17
# 2: janedoe        995 12/15/16      NA     NA
# 3: johndoe        550   9/1/16 11/1/16     NA

dcast converts from long to wide and it accepts expressions. rowid is just adding a row counter per pmt_amount.

Upvotes: 1

austensen
austensen

Reputation: 3017

You can use tidyr for this.

library(dplyr)
library(tidyr)

pmts <- tibble(
  name = c("johndoe", "johndoe", "janedoe", "foo", "foo", "foo"),
  pmt_amount = c(550L, 550L, 995L, 375L, 375L, 375L),
  pmt_date = lubridate::mdy(c("9/1/16", "11/1/16", "12/15/16", "1/5/17", "3/5/17", "5/5/17"))
)

pmts
#> # A tibble: 6 x 3
#>      name pmt_amount   pmt_date
#>     <chr>      <int>     <date>
#> 1 johndoe        550 2016-09-01
#> 2 johndoe        550 2016-11-01
#> 3 janedoe        995 2016-12-15
#> 4     foo        375 2017-01-05
#> 5     foo        375 2017-03-05
#> 6     foo        375 2017-05-05

pmts_long <- pmts %>% 
  group_by(name) %>% 
  arrange(name, pmt_date) %>% 
  mutate(pmt = row_number()) %>% 
  ungroup() %>% 
  complete(name, nesting(pmt)) %>% 
  fill(pmt_amount, .direction = "down")

pmts_long
#> # A tibble: 9 x 4
#>      name   pmt pmt_amount   pmt_date
#>     <chr> <int>      <int>     <date>
#> 1     foo     1        375 2017-01-05
#> 2     foo     2        375 2017-03-05
#> 3     foo     3        375 2017-05-05
#> 4 janedoe     1        995 2016-12-15
#> 5 janedoe     2        995         NA
#> 6 janedoe     3        995         NA
#> 7 johndoe     1        550 2016-09-01
#> 8 johndoe     2        550 2016-11-01
#> 9 johndoe     3        550         NA

pmts_wide <- pmts_long %>% 
  gather("key", "val", -name, -pmt_amount, -pmt) %>% 
  unite(pmt_number, key, pmt) %>% 
  spread(pmt_number, val)

pmts_wide
#> # A tibble: 3 x 5
#>      name pmt_amount pmt_date_1 pmt_date_2 pmt_date_3
#> *   <chr>      <int>     <date>     <date>     <date>
#> 1     foo        375 2017-01-05 2017-03-05 2017-05-05
#> 2 janedoe        995 2016-12-15         NA         NA
#> 3 johndoe        550 2016-09-01 2016-11-01         NA

Upvotes: 1

Related Questions