Reputation: 75
I have a data frame of strangely formatted data, where information is stored as part of the column name.
library(tidyverse)
Ihave <- frame_data(
~ID,~group,~AAA_info2_BBB,~CCC_info3_DDD,
"first", 1, as.Date("1970-01-01"), as.Date("1970-01-02"),
"second", 2, as.Date("1971-01-01"), as.Date("1971-01-02"),
"third", 3, as.Date("1972-01-01"), as.Date("1972-01-02"),
)
# A tibble: 3 x 4
ID group AAA_info2_BBB CCC_info3_DDD
<chr> <dbl> <date> <date>
1 first 1 1970-01-01 1970-01-02
2 second 2 1971-01-01 1971-01-02
3 third 3 1972-01-01 1972-01-02
I would need to get the information back in the data frame, as shown below
Iwant <- frame_data(
~ID,~group,~source,~variable,~value,~period,
"first", 1, "AAA", "info1", as.Date("1970-01-01"), "BBB",
"second", 2, "AAA", "info1", as.Date("1971-01-01"), "BBB",
"third", 3, "AAA", "info1", as.Date("1972-01-01"), "BBB",
"first", 1, "CCC", "info2", as.Date("1970-01-02"), "DDD",
"second", 2, "CCC", "info2", as.Date("1971-01-02"), "DDD",
"third", 3, "CCC", "info2", as.Date("1972-01-02"), "DDD",
)
# A tibble: 6 x 6
ID group source variable value period
<chr> <dbl> <chr> <chr> <date> <chr>
1 first 1 AAA info1 1970-01-01 BBB
2 second 2 AAA info1 1971-01-01 BBB
3 third 3 AAA info1 1972-01-01 BBB
4 first 1 CCC info2 1970-01-02 DDD
5 second 2 CCC info2 1971-01-02 DDD
6 third 3 CCC info2 1972-01-02 DDD
I though it could work by writing a function that handles one of the "AAA_info2_BBB"-type columns at a time, and it seems to work using the following function
my_fun <- function(df, one_var) {
# Get string from called column name
one_var_char <-
enquo(one_var) %>%
{ as.character(.)[2] }
# Split string across "_" and return character vector
one_var_char_splitted <-
one_var_char %>%
{ strsplit(., "_")[[1]] }
new_one_var <- one_var_char_splitted[2]
names(df)[names(df) == one_var_char] <- new_one_var
df %>%
select(new_one_var) %>%
data.frame(source = one_var_char_splitted[1],
period = one_var_char_splitted[3] )
}
Which returns (as expected)
Ihave %>%
select(ID, group, AAA_info2_BBB) %>%
my_fun(AAA_info2_BBB)
info2 source period
1 1970-01-01 AAA BBB
2 1971-01-01 AAA BBB
3 1972-01-01 AAA BBB
But I cannot manage to "map" this function onto the Ihave
data frame to produce the desired Iwant
. I have tried several mixtures of purrr::map
, without success. Is there a flaw in my approach? Have I missed something?
Any help much appreciated!
Upvotes: 1
Views: 76
Reputation: 160892
I fashioned this before I saw @aosmith's comments, which is spot-on:
library(dplyr)
library(tidyr)
Ihave %>%
gather(source, value, -ID, -group) %>%
separate(source, into = c("source", "variable", "period"), sep = "_")
# # A tibble: 6 x 6
# ID group source variable period value
# <chr> <dbl> <chr> <chr> <chr> <date>
# 1 first 1 AAA info2 BBB 1970-01-01
# 2 second 2 AAA info2 BBB 1971-01-01
# 3 third 3 AAA info2 BBB 1972-01-01
# 4 first 1 CCC info3 DDD 1970-01-02
# 5 second 2 CCC info3 DDD 1971-01-02
# 6 third 3 CCC info3 DDD 1972-01-02
It relies on the number of _
-separated fields being constant, ordered, and known. If the format never changes, you're good. Otherwise, you'll need to write something a little more specific/custom to deal with any variation.
You do not need to explicitly invoke library(dplyr)
or tidyr
if you are already loading library(tidyverse)
. (I include them here in case (a) somebody comes along and is not explicitly loading all 25 packages, or (b) you thought you needed all of them but want to reduce load time by trimming unused packages.)
Upvotes: 3
Reputation: 28705
Same idea as gather
then separate
, but just for the sake of variety here's a data.table
method using melt
and tstrsplit
library(data.table)
setDT(Ihave)
melt(Ihave, c('ID', 'group'))[,
c('source', 'variable', 'period') := tstrsplit(variable, '_')]
# ID group variable value source period
# 1: first 1 info2 1970-01-01 AAA BBB
# 2: second 2 info2 1971-01-01 AAA BBB
# 3: third 3 info2 1972-01-01 AAA BBB
# 4: first 1 info3 1970-01-02 CCC DDD
# 5: second 2 info3 1971-01-02 CCC DDD
# 6: third 3 info3 1972-01-02 CCC DDD
Upvotes: 1