tidying multiple columns with dplyr

Question

I know data.table is able to tidy multiple columns at once, unlike dplyr, which relies on multiple gather and spread steps that can be difficult to visualize.

Here's a knotty problem for the tidyverse:

library(tidyverse)
df <- data_frame(month_1 = c("Jan", "Feb", "Mar", "Jun"),
                        score_1 = c(4, 5, 6, 4),
                        month_2 = c("Jan", "Mar", NA, NA),
                        score_2 = c(3, 2, NA, NA),
                        month_3 = c("Feb", "Mar", "Jun", NA),
                        score_3 = c(8, 7, 4, NA))

# A tibble: 4 x 6
  month_1 score_1 month_2 score_2 month_3 score_3
                   
1 Jan           4 Jan           3 Feb           8
2 Feb           5 Mar           2 Mar           7
3 Mar           6 NA           NA Jun           4
4 Jun           4 NA           NA NA           NA

My desired result would be:

id month score
1  Jan   4
1  Feb   5
1  Mar   6
1  Jun   4
2  Jan   3
2  Mar   2
3  Feb   8  
3  Mar   7
3  Jun   4

data.table users can solve this by melting with patterns, as:

melt(setDT(df), measure = patterns("^month", "^score"))

But since there's no equivalent dplyr function, I understand there will need to be several spread. It seems like my solution below should work, but the second spread goes awry:

df %>% 
  gather(key, value) %>% 
  mutate(id = parse_number(key),
         key = str_replace(key, "_[0-9]", "")) %>% 
  spread(key, value )

Before marking this as a duplicate, please give it a try. Similar questions have the unique id in an existing column. This example has ids in the headers.

andrew_reece · Accepted Answer

You can process month and score columns separately and then join them with purrr::map_dfc:

map_dfc(c("month", "score"), 
        ~ df %>%
          select_at(vars(matches(.x))) %>%
          gather(key, !!.x) %>%
          separate(key, c("col", "id"), sep="_")) %>% 
  filter(complete.cases(.)) %>%
  select(id, month, score)

# A tibble: 9 x 3
 id   month score 
  
1 1     Jan   4    
2 1     Feb   5    
3 1     Mar   6    
4 1     Jun   4    
5 2     Jan   3    
6 2     Mar   2    
7 3     Feb   8    
8 3     Mar   7    
9 3     Jun   4

Explanation:

map_dfc iterates over the string values "month" and "score", referring to the current value as .x. The dfc suffix performs a cbind on the iterated outputs.
select_at chooses only the columns that start with .x
gather swings from wide to long, and names the values column with the .x string value.
separate splits key into two columns, containing the column type (which corresponds to the .x value) and the id number.
Once the mapping and column-binding is complete, we filter out missing values and select our target columns.

tidying multiple columns with dplyr

Answers (1)

Related Questions