Tidying data with several repeating variables in R

Question

I have a dataframe that looks like the following. There are several variables (like "c" and "z") with measurements for health, animals, enviro, and money. In the actual dataframe, there are many other columns that do not follow this pattern and are interspersed throughout.

id  c_health  c_animals  c_enviro  c_money  z_health  z_animals  z_enviro  z_money
1   3         2          4         5        7         9          6         8
2   2         3          5         4        8         7          6         9
3   4         1          2         3        9         6          8         7

I am trying to rearrange the data to make it "tidy". I am not sure what to do when there are several variables like in my current dataset. This is the kind of result I would eventually like to end up with:

id  c  z  message
1   3  7  health
1   2  9  animals
1   4  6  enviro
1   5  8  money
2   2  8  health
2   3  7  animals
2   5  6  enviro
2   4  9  money
3   4  9  health
3   1  6  animals
3   2  8  enviro
3   3  7  money

If the dataframe just included the following columns, I could make it tidy in the following way:

id  c_health  c_animals  c_enviro  c_money
1   3         2          4         5
2   2         3          5         4
3   4         1          2         3

df <- df %>%
   gather(., key = "question", value = "response", 2:5)

Simon.S.A. · Accepted Answer

You are on the right track with using gather, but need some additional steps to split the prefix off the column names. Try the following:

library(dplyr)
library(tidyr)

df = data.frame(
  id = c(1,2,3),
  c_health = c(3,2,4),
  c_animals = c(2,3,1),
  z_health = c(7,8,9),
  z_animals = c(9,7,6),
  stringsAsFactors = FALSE
)

output = df %>%
  # gather on all columns other than id
  gather(key = "question", value = "response", -all_of("id")) %>%
  # split off prefix and rest of column name
  mutate(prefix = substr(question,1,1),
         desc = substr(question,3,nchar(question))) %>%
  # keep just the columns of interest
  select(id, prefix, desc, response) %>%
  # reshape wider
  spread(prefix, response)

Update - my comment on differing prefix lengths does not return the correct answer. Because [] indexing does not work that way inside mutate. Same idea but correct syntax as follows:

output = df %>%
  # gather on all columns other than id
  gather(key = "question", value = "response", -all_of("id")) %>%
  # split off prefix and rest of column name
  mutate(split = strsplit(question, "_")) %>%
  mutate(prefix = sapply(split, function(x){x[1]}),
         desc = sapply(split, function(x){x[2]})) %>%
  # keep just the columns of interest
  select(id, prefix, desc, response) %>%
  # reshape wider
  spread(prefix, response)

Tidying data with several repeating variables in R

Answers (2)

Related Questions