Reshaping data frame to long list of key:values

Question

I have a dataset row data in the following format:

Id Avg_bed_2017 staff_2014 staff_2015 staff_2016 staff_2017 outpatient_2013
1  460265      0.00000         NA         NA         NA         NA                      NA
2  16121       13.66667        497      508.5        515        505

and my goal is to reshape this into the following format:

Id category year  value 
1   Avg_bed 2017  460265
2   Avg_bed 2017  16121 
3   staff   2014  13.667
4   .....   ....  .....

For this I have defined a function split.col.name(col) that uses as input each name of column and splits it into the category and the year and returns them as list with two elements.

I then designed a function split.col.row (rowdline) then takes as input a row of the row data and returns a row of the latter table.

I tried then (1) to run split.col.row (rowdline) on the rows of row data (257K rows) and (2) apply the apply() function on each row. The (1) is very slow and the (2) is not binding the rows properly (producing a matrix limited to the size of the input data whilst in this case the output would have much more rows). Maybe there is a more effective way to do this reshaping?

Mark White · Accepted Answer

Here is a tidyverse solution:

# make sample data
set.seed(1839)
dat <- data.frame(
  id = 1:3,
  avgbed_2017 = rnorm(3), # note: rename this column
  staff_2014 = rnorm(3),
  staff_2015 = rnorm(3),
  outpatient_2013 = rnorm(3)
)

# gather and separate
library(dplyr)
library(tidyr)
dat %>% 
  gather("key", "value", -id) %>% # gathers everything but id
  # if you wanna get rid of NA values, say na.rm = TRUE in gather
  separate("key", c("category", "year"), sep = "_")

Note that you have to rename avg_bed_2017 to avgbed_2017 for this to work. If I were better at regular expressions, I would make sep in separate be a regular expression that matches only the second occurrence of an underscore—perhaps someone else can show me how. To get around it, I just renamed the column above so that the only underscore is the one we want to split on.

Per @Dave2e:

set.seed(1839)
data.frame(
  id = 1:3,
  avg_bed_2017 = rnorm(3),
  staff_2014 = rnorm(3),
  staff_2015 = rnorm(3),
  outpatient_2013 = rnorm(3)
) %>%
  gather("key", "value", -id) %>% 
  separate("key", c("category", "year"), sep = "_(?=[0-9]{4})")

Reshaping data frame to long list of key:values

Answers (1)

Related Questions