fantanaman
fantanaman

Reputation: 35

Tidy data with both multiple variables in column names and multiple observations per row

I have an untidy dataset.

dat_untidy <- data.frame(AFDWSDF = c(10, 20),
                         AFDWSDFQUAL = c("a","b"),
                         TPFCE = c(1, 1.5),
                         TPFCEQUAL = c("c","d"))

Variable names contain 3 or 4 pieces of information that I would like to decompose and store as individual variables:

dat_tidy <- data.frame(Var_X = c(rep("AFDW", 2),rep("TP", 2)),
                   Var_Y = c(rep("SD", 2), rep("FC", 2)),
                   Var_Z = c(rep("F", 2), rep("E", 2)),
                   Value = c(10, 20, 1, 1.5),
                   Qualifier = c("a","b","c","d"))

The variable names are of the form xxxxyyz. The 'xxxx', 'yy', and 'z' should each become a variable value. 'yy' and 'z' are always 2 and 1 characters, respectively, but 'xxxx' can be any number of characters.

On top of this, as in my example, the variables are (mostly) in pairs, with 'xxxxyyz' adjacent to 'xxxxyyzQUAL$' (qualifier code for the datum value). The "...QUAL$"s also need to be gathered into a tidy variable.

From the Anscombe example in the vingette, it feels like pivot_longer() could be an elegant solution, but I need help defining the regex pattern. I suspect it might be simpler to break this into two steps, perhaps first a pivot_longer() to collect the variable names into a so-called Var_XYZ, with associated Value and Qualifier, and then a mutate() to decompose the Var_XYZ, but I can't put my finger on this either.

I'm a complete novice with regex, and most comfortable with dplyr for the manipulations.

Thank you in advance.

Upvotes: 0

Views: 556

Answers (2)

Wimpel
Wimpel

Reputation: 27802

Here is a data.table solution the should work...

library( data.table )
library( stringr )

#set untidy data to data.table format
setDT(dat_untidy)

#get names of columns not ending on QUAL
  cols <- names( dat_untidy )[!grepl( "QUAL$", names( dat_untidy ) ) ]
  #[1] "AFDWSDF" "TPFCE" 
  Var_X <- stringr::str_sub( cols,  1, -4 )
  #[1] "AFDW" "TP"
  Var_Y <- stringr::str_sub( cols, -3, -2 )
  #[1] "SD" "FC"
  Var_Z <- stringr::str_sub( cols, -1, -1 )  
  #[1] "F" "E"

#build regex patterns
  Var_X_pattern <- paste0( "(", paste0("^", Var_X, collapse = "|" ), ").*" )
  #[1] "(^AFDW|^TP).*"
  Var_Y_pattern <- paste0( ".*(", paste0(Var_Y, collapse = "|" ), 
                           ")(", paste0(Var_Z, "$", collapse = "|" ), ")" ) 
  #[1] ".*(SD|FC)(F$|E$)"
  Var_Z_pattern <- paste0( ".*(", paste0(Var_Z, "$", collapse = "|" ), ")" ) 
  #[1] ".*(F$|E$)"

#melt to long format (read, and then ignore, warning)
DT <- melt( dat_untidy, measure.vars = names(dat_untidy), variable.factor = FALSE )

#create columns
DT[, `:=`( Var_X = gsub( Var_X_pattern, "\\1", variable ),
           Var_Y = gsub( Var_Y_pattern, "\\1", variable ),
           Var_Z = gsub( Var_Z_pattern, "\\1", variable ) ) ][]

#create group_row_id
DT[, row_id := rowid( variable ) ][]
#create final output by joining, and then dropping the row_id column
DT[ !grepl("QUAL", variable ), .(Var_X, Var_Y, Var_Z, Value = value, row_id ) ][DT, Qualifier := i.value, on = .(Var_X, row_id) ][, row_id := NULL][]

output

#    Var_X Var_Y Var_Z Value Qualifier
# 1:  AFDW    SD     F    10         a
# 2:  AFDW    SD     F    20         b
# 3:    TP    FC     E     1         c
# 4:    TP    FC     E   1.5         d

Upvotes: 1

Edward
Edward

Reputation: 19394

I've used a combination of reshape from the stats package and separate from the tidyr package.

First, store the pairs of variables into the "Qualifiers" and "Vars".

Quals <- grep("QUAL$", names(dat_untidy), value=TRUE)
Vars <- sub("QUAL", "", Quals)

Then, reshape into long form and separate Vars into the three _X, _Y, _Z columns.

library(tidyr)

dat_tidy <- reshape(dat_untidy, direction="long", varying=list(Vars, Quals),
        v.names=c("Value","Qualifier"),
        timevar="Vars", times=Vars) %>%
  separate(col="Vars", into=c("Var_X","Var_Y","Var_Z"), 
           remove = TRUE, sep = -c(3,1)) %>%  # Reverse separator
  select(-id)

rownames(dat_tidy) <- NULL

dat_tidy

  Var_X Var_Y Var_Z Value Qualifier
1  AFDW    SD     F  10.0         a
2  AFDW    SD     F  20.0         b
3    TP    FC     E   1.0         c
4    TP    FC     E   1.5         d

Upvotes: 1

Related Questions