Dom
Dom

Reputation: 187

Parse data frame column name, and pivot data frame R

When executing the following reproducible example code,

value = c(1:5)
name = c('A','B','C','D','F')
`2015.cost` = c(100,200,250,300,450)
`2016.cost` = c(200,300,350,400,550)
`2017.cost` = c(300,400,450,500,650)
`2015.profit` = c(1000,4200,2450,1500,7650)
`2016.profit` = c(1300,4300,3450,5100,6850)
`2017.profit` = c(1300,4400,6450,5001,6500)
df <- data.frame(value,name,`2015.cost`,`2016.cost`,`2017.cost`,`2015.profit`,`2016.profit`,`2017.profit`)

I create the following data frame:

structure(list(value = 1:5, name = structure(1:5, .Label = c("A", 
"B", "C", "D", "F"), class = "factor"), X2015.cost = c(100, 200, 
250, 300, 450), X2016.cost = c(200, 300, 350, 400, 550), X2017.cost = c(300, 
400, 450, 500, 650), X2015.profit = c(1000, 4200, 2450, 1500, 
7650), X2016.profit = c(1300, 4300, 3450, 5100, 6850), X2017.profit = c(1300, 
4400, 6450, 5001, 6500)), .Names = c("value", "name", "X2015.cost", 
"X2016.cost", "X2017.cost", "X2015.profit", "X2016.profit", "X2017.profit"
), row.names = c(NA, -5L), class = "data.frame")

I essentially want to be able to parse out the years in my cost (at a delimiter such as that '.') and profit column names and then pivot the data frame from this wide format to long, so that I have years all in a column and cost, profit values as another column. My goal is to have a data frame like so:

value name  year    cost    profit
1    A   2015    100    1000
2    B   2015    200    4200
3    C   2015    250    2450
4    D   2015    300    1500
5    E   2015    450    7650
1    A   2016    200    1300
2    B   2016    300    4300
3    C   2016    350    3450
4    D   2016    400    5100
5    E   2016    550    6850
1    A   2017    300    1300
2    B   2017    400    4400
3    C   2017    450    6450
4    D   2017    500    5001
5    E   2017    650    6500

Any help would be appreciated

Upvotes: 1

Views: 732

Answers (2)

Matthew Plourde
Matthew Plourde

Reputation: 44614

Here's one way with dplyr and tidyr

library(dplyr)
library(tidyr)
df %>%
    gather('col', 'val', X2015.cost:X2017.profit) %>%
    separate(col, c('year', 'col_type'), sep='\\.') %>%
    mutate(year=extract_numeric(year)) %>%
    spread(col_type, val)

Upvotes: 4

akrun
akrun

Reputation: 886938

We can use melt from data.table which can take multiple measure columns.

library(data.table)
dM <-  melt(setDT(df1), measure=patterns('cost$', 'profit$'), 
    variable.name='year', value.name=c('cost', 'profit'))

dM[ ,year :=as.numeric(unique(gsub('\\D+', '', 
     grep('^X\\d+', names(df1), value=TRUE))))[year]]
dM
#    value name year cost profit
# 1:     1    A 2015  100   1000
# 2:     2    B 2015  200   4200
# 3:     3    C 2015  250   2450
# 4:     4    D 2015  300   1500
# 5:     5    F 2015  450   7650
# 6:     1    A 2016  200   1300
# 7:     2    B 2016  300   4300
# 8:     3    C 2016  350   3450
# 9:     4    D 2016  400   5100
#10:     5    F 2016  550   6850
#11:     1    A 2017  300   1300
#12:     2    B 2017  400   4400
#13:     3    C 2017  450   6450
#14:     4    D 2017  500   5001
#15:     5    F 2017  650   6500

Upvotes: 3

Related Questions