xvrtzn
xvrtzn

Reputation: 75

Column manipulation in R - data stored as column name

I have a data frame of strangely formatted data, where information is stored as part of the column name.

library(tidyverse)

Ihave <- frame_data(
  ~ID,~group,~AAA_info2_BBB,~CCC_info3_DDD,
  "first",  1, as.Date("1970-01-01"), as.Date("1970-01-02"),
  "second", 2, as.Date("1971-01-01"), as.Date("1971-01-02"),
  "third",  3, as.Date("1972-01-01"), as.Date("1972-01-02"),
)

# A tibble: 3 x 4
  ID     group AAA_info2_BBB CCC_info3_DDD
  <chr>  <dbl> <date>        <date>       
1 first      1 1970-01-01    1970-01-02   
2 second     2 1971-01-01    1971-01-02   
3 third      3 1972-01-01    1972-01-02   

I would need to get the information back in the data frame, as shown below

Iwant <-  frame_data(
  ~ID,~group,~source,~variable,~value,~period,
  "first",  1, "AAA", "info1", as.Date("1970-01-01"), "BBB",
  "second", 2, "AAA", "info1", as.Date("1971-01-01"), "BBB",
  "third",  3, "AAA", "info1", as.Date("1972-01-01"), "BBB",
  "first",  1, "CCC", "info2", as.Date("1970-01-02"), "DDD",
  "second", 2, "CCC", "info2", as.Date("1971-01-02"), "DDD",
  "third",  3, "CCC", "info2", as.Date("1972-01-02"), "DDD",
)

# A tibble: 6 x 6
  ID     group source variable value      period
  <chr>  <dbl> <chr>  <chr>    <date>     <chr> 
1 first      1 AAA    info1    1970-01-01 BBB   
2 second     2 AAA    info1    1971-01-01 BBB   
3 third      3 AAA    info1    1972-01-01 BBB   
4 first      1 CCC    info2    1970-01-02 DDD   
5 second     2 CCC    info2    1971-01-02 DDD   
6 third      3 CCC    info2    1972-01-02 DDD   

I though it could work by writing a function that handles one of the "AAA_info2_BBB"-type columns at a time, and it seems to work using the following function

my_fun <- function(df, one_var) {

  # Get string from called column name
  one_var_char <- 
    enquo(one_var) %>%  
    { as.character(.)[2] }  

  # Split string across "_" and return character vector
  one_var_char_splitted <- 
    one_var_char %>% 
    { strsplit(., "_")[[1]] }

  new_one_var <- one_var_char_splitted[2]

  names(df)[names(df) == one_var_char] <- new_one_var

  df %>%
    select(new_one_var) %>% 
    data.frame(source = one_var_char_splitted[1],
               period = one_var_char_splitted[3] )
}

Which returns (as expected)

Ihave %>% 
  select(ID, group, AAA_info2_BBB) %>% 
  my_fun(AAA_info2_BBB)

       info2 source period
1 1970-01-01    AAA    BBB
2 1971-01-01    AAA    BBB
3 1972-01-01    AAA    BBB

But I cannot manage to "map" this function onto the Ihave data frame to produce the desired Iwant. I have tried several mixtures of purrr::map, without success. Is there a flaw in my approach? Have I missed something?

Any help much appreciated!

Upvotes: 1

Views: 76

Answers (2)

r2evans
r2evans

Reputation: 160892

I fashioned this before I saw @aosmith's comments, which is spot-on:

library(dplyr)
library(tidyr)
Ihave %>%
  gather(source, value, -ID, -group) %>%
  separate(source, into = c("source", "variable", "period"), sep = "_")
# # A tibble: 6 x 6
#   ID     group source variable period value     
#   <chr>  <dbl> <chr>  <chr>    <chr>  <date>    
# 1 first      1 AAA    info2    BBB    1970-01-01
# 2 second     2 AAA    info2    BBB    1971-01-01
# 3 third      3 AAA    info2    BBB    1972-01-01
# 4 first      1 CCC    info3    DDD    1970-01-02
# 5 second     2 CCC    info3    DDD    1971-01-02
# 6 third      3 CCC    info3    DDD    1972-01-02

It relies on the number of _-separated fields being constant, ordered, and known. If the format never changes, you're good. Otherwise, you'll need to write something a little more specific/custom to deal with any variation.

You do not need to explicitly invoke library(dplyr) or tidyr if you are already loading library(tidyverse). (I include them here in case (a) somebody comes along and is not explicitly loading all 25 packages, or (b) you thought you needed all of them but want to reduce load time by trimming unused packages.)

Upvotes: 3

IceCreamToucan
IceCreamToucan

Reputation: 28705

Same idea as gather then separate, but just for the sake of variety here's a data.table method using melt and tstrsplit

library(data.table)
setDT(Ihave)

melt(Ihave, c('ID', 'group'))[, 
  c('source', 'variable', 'period') := tstrsplit(variable, '_')]

#        ID group variable      value source period
# 1:  first     1    info2 1970-01-01    AAA    BBB
# 2: second     2    info2 1971-01-01    AAA    BBB
# 3:  third     3    info2 1972-01-01    AAA    BBB
# 4:  first     1    info3 1970-01-02    CCC    DDD
# 5: second     2    info3 1971-01-02    CCC    DDD
# 6:  third     3    info3 1972-01-02    CCC    DDD

Upvotes: 1

Related Questions