Reputation: 35
I have an untidy dataset.
dat_untidy <- data.frame(AFDWSDF = c(10, 20),
AFDWSDFQUAL = c("a","b"),
TPFCE = c(1, 1.5),
TPFCEQUAL = c("c","d"))
Variable names contain 3 or 4 pieces of information that I would like to decompose and store as individual variables:
dat_tidy <- data.frame(Var_X = c(rep("AFDW", 2),rep("TP", 2)),
Var_Y = c(rep("SD", 2), rep("FC", 2)),
Var_Z = c(rep("F", 2), rep("E", 2)),
Value = c(10, 20, 1, 1.5),
Qualifier = c("a","b","c","d"))
The variable names are of the form xxxxyyz. The 'xxxx', 'yy', and 'z' should each become a variable value. 'yy' and 'z' are always 2 and 1 characters, respectively, but 'xxxx' can be any number of characters.
On top of this, as in my example, the variables are (mostly) in pairs, with 'xxxxyyz' adjacent to 'xxxxyyzQUAL$' (qualifier code for the datum value). The "...QUAL$"s also need to be gathered into a tidy variable.
From the Anscombe example in the vingette, it feels like pivot_longer()
could be an elegant solution, but I need help defining the regex pattern. I suspect it might be simpler to break this into two steps, perhaps first a pivot_longer()
to collect the variable names into a so-called Var_XYZ
, with associated Value
and Qualifier
, and then a mutate()
to decompose the Var_XYZ
, but I can't put my finger on this either.
I'm a complete novice with regex, and most comfortable with dplyr for the manipulations.
Thank you in advance.
Upvotes: 0
Views: 556
Reputation: 27802
Here is a data.table solution the should work...
library( data.table )
library( stringr )
#set untidy data to data.table format
setDT(dat_untidy)
#get names of columns not ending on QUAL
cols <- names( dat_untidy )[!grepl( "QUAL$", names( dat_untidy ) ) ]
#[1] "AFDWSDF" "TPFCE"
Var_X <- stringr::str_sub( cols, 1, -4 )
#[1] "AFDW" "TP"
Var_Y <- stringr::str_sub( cols, -3, -2 )
#[1] "SD" "FC"
Var_Z <- stringr::str_sub( cols, -1, -1 )
#[1] "F" "E"
#build regex patterns
Var_X_pattern <- paste0( "(", paste0("^", Var_X, collapse = "|" ), ").*" )
#[1] "(^AFDW|^TP).*"
Var_Y_pattern <- paste0( ".*(", paste0(Var_Y, collapse = "|" ),
")(", paste0(Var_Z, "$", collapse = "|" ), ")" )
#[1] ".*(SD|FC)(F$|E$)"
Var_Z_pattern <- paste0( ".*(", paste0(Var_Z, "$", collapse = "|" ), ")" )
#[1] ".*(F$|E$)"
#melt to long format (read, and then ignore, warning)
DT <- melt( dat_untidy, measure.vars = names(dat_untidy), variable.factor = FALSE )
#create columns
DT[, `:=`( Var_X = gsub( Var_X_pattern, "\\1", variable ),
Var_Y = gsub( Var_Y_pattern, "\\1", variable ),
Var_Z = gsub( Var_Z_pattern, "\\1", variable ) ) ][]
#create group_row_id
DT[, row_id := rowid( variable ) ][]
#create final output by joining, and then dropping the row_id column
DT[ !grepl("QUAL", variable ), .(Var_X, Var_Y, Var_Z, Value = value, row_id ) ][DT, Qualifier := i.value, on = .(Var_X, row_id) ][, row_id := NULL][]
output
# Var_X Var_Y Var_Z Value Qualifier
# 1: AFDW SD F 10 a
# 2: AFDW SD F 20 b
# 3: TP FC E 1 c
# 4: TP FC E 1.5 d
Upvotes: 1
Reputation: 19394
I've used a combination of reshape
from the stats package and separate
from the tidyr package.
First, store the pairs of variables into the "Qualifiers" and "Vars".
Quals <- grep("QUAL$", names(dat_untidy), value=TRUE)
Vars <- sub("QUAL", "", Quals)
Then, reshape into long form and separate Vars
into the three _X
, _Y
, _Z
columns.
library(tidyr)
dat_tidy <- reshape(dat_untidy, direction="long", varying=list(Vars, Quals),
v.names=c("Value","Qualifier"),
timevar="Vars", times=Vars) %>%
separate(col="Vars", into=c("Var_X","Var_Y","Var_Z"),
remove = TRUE, sep = -c(3,1)) %>% # Reverse separator
select(-id)
rownames(dat_tidy) <- NULL
dat_tidy
Var_X Var_Y Var_Z Value Qualifier
1 AFDW SD F 10.0 a
2 AFDW SD F 20.0 b
3 TP FC E 1.0 c
4 TP FC E 1.5 d
Upvotes: 1