Reputation: 355
I have a dataset row data
in the following format:
Id Avg_bed_2017 staff_2014 staff_2015 staff_2016 staff_2017 outpatient_2013
1 460265 0.00000 NA NA NA NA NA
2 16121 13.66667 497 508.5 515 505
and my goal is to reshape this into the following format:
Id category year value
1 Avg_bed 2017 460265
2 Avg_bed 2017 16121
3 staff 2014 13.667
4 ..... .... .....
For this I have defined a function split.col.name(col)
that uses as input each name of column and splits it into the category
and the year
and returns them as list with two elements.
I then designed a function split.col.row (rowdline)
then takes as input a row of the row data
and returns a row of the latter table.
I tried then (1) to run split.col.row (rowdline)
on the rows of row data
(257K rows) and (2) apply the apply()
function on each row. The (1) is very slow and the (2) is not binding the rows properly (producing a matrix limited to the size of the input data whilst in this case the output would have much more rows). Maybe there is a more effective way to do this reshaping?
Upvotes: 0
Views: 31
Reputation: 1496
Here is a tidyverse
solution:
# make sample data
set.seed(1839)
dat <- data.frame(
id = 1:3,
avgbed_2017 = rnorm(3), # note: rename this column
staff_2014 = rnorm(3),
staff_2015 = rnorm(3),
outpatient_2013 = rnorm(3)
)
# gather and separate
library(dplyr)
library(tidyr)
dat %>%
gather("key", "value", -id) %>% # gathers everything but id
# if you wanna get rid of NA values, say na.rm = TRUE in gather
separate("key", c("category", "year"), sep = "_")
Note that you have to rename avg_bed_2017
to avgbed_2017
for this to work. If I were better at regular expressions, I would make sep
in separate
be a regular expression that matches only the second occurrence of an underscore—perhaps someone else can show me how. To get around it, I just renamed the column above so that the only underscore is the one we want to split on.
Per @Dave2e:
set.seed(1839)
data.frame(
id = 1:3,
avg_bed_2017 = rnorm(3),
staff_2014 = rnorm(3),
staff_2015 = rnorm(3),
outpatient_2013 = rnorm(3)
) %>%
gather("key", "value", -id) %>%
separate("key", c("category", "year"), sep = "_(?=[0-9]{4})")
Upvotes: 2