Tidy data with both multiple variables in column names and multiple observations per row

Question

I have an untidy dataset.

dat_untidy <- data.frame(AFDWSDF = c(10, 20),
                         AFDWSDFQUAL = c("a","b"),
                         TPFCE = c(1, 1.5),
                         TPFCEQUAL = c("c","d"))

Variable names contain 3 or 4 pieces of information that I would like to decompose and store as individual variables:

dat_tidy <- data.frame(Var_X = c(rep("AFDW", 2),rep("TP", 2)),
                   Var_Y = c(rep("SD", 2), rep("FC", 2)),
                   Var_Z = c(rep("F", 2), rep("E", 2)),
                   Value = c(10, 20, 1, 1.5),
                   Qualifier = c("a","b","c","d"))

The variable names are of the form xxxxyyz. The 'xxxx', 'yy', and 'z' should each become a variable value. 'yy' and 'z' are always 2 and 1 characters, respectively, but 'xxxx' can be any number of characters.

On top of this, as in my example, the variables are (mostly) in pairs, with 'xxxxyyz' adjacent to 'xxxxyyzQUAL$' (qualifier code for the datum value). The "...QUAL$"s also need to be gathered into a tidy variable.

From the Anscombe example in the vingette, it feels like pivot_longer() could be an elegant solution, but I need help defining the regex pattern. I suspect it might be simpler to break this into two steps, perhaps first a pivot_longer() to collect the variable names into a so-called Var_XYZ, with associated Value and Qualifier, and then a mutate() to decompose the Var_XYZ, but I can't put my finger on this either.

I'm a complete novice with regex, and most comfortable with dplyr for the manipulations.

Thank you in advance.

Edward · Accepted Answer

I've used a combination of reshape from the stats package and separate from the tidyr package.

First, store the pairs of variables into the "Qualifiers" and "Vars".

Quals <- grep("QUAL$", names(dat_untidy), value=TRUE)
Vars <- sub("QUAL", "", Quals)

Then, reshape into long form and separate Vars into the three _X, _Y, _Z columns.

library(tidyr)

dat_tidy <- reshape(dat_untidy, direction="long", varying=list(Vars, Quals),
        v.names=c("Value","Qualifier"),
        timevar="Vars", times=Vars) %>%
  separate(col="Vars", into=c("Var_X","Var_Y","Var_Z"), 
           remove = TRUE, sep = -c(3,1)) %>%  # Reverse separator
  select(-id)

rownames(dat_tidy) <- NULL

dat_tidy

  Var_X Var_Y Var_Z Value Qualifier
1  AFDW    SD     F  10.0         a
2  AFDW    SD     F  20.0         b
3    TP    FC     E   1.0         c
4    TP    FC     E   1.5         d

Tidy data with both multiple variables in column names and multiple observations per row

Answers (2)

Related Questions