Reputation: 157
I have a list of data frames, all of which are the same dimensions (64 obs, 12 variables). I need to "flatten" these data frames in such a way that I return with 64 x 11 = 704 variables and one observation, deriving all combinations of one column that has all unique values and the column names of the data frame. Examples are provided below.
I have attempted using acast
and melt
to achieve this. However, the supporting operations both pre and post melt make this approach slow when having to lapply this approach over 100k+ data frames.
Here is an example data frame and my taken approach:
df <- data.frame(var1=c(1,2,3),name=c("these","are","names"),var3=c(4,NA,NA),var4=c(NA,NA,5),var6=c(NA,5,NA))
flattening <- function(df){
rownames(df) <- df$name
df$name <- NULL
df <- melt(as.matrix(df)) %>% group_by(name = paste0(Var1,"_",Var2)) %>% summarise(
value = first(value)
) %>% data.frame()
cnames <- df$name
df <- data.frame(values=df$value) %>% t() %>% data.frame()
names(df) <- cnames
df
}
flattening(df)
The example df looks as such:
var1 name var3 var4 var6
1 1 these 4 NA NA
2 2 are NA NA 5
3 3 names NA 5 NA
I am looking for the expected outcome:
are_var1 are_var3 are_var4 are_var6 names_var1 names_var3 names_var4 names_var6 these_var1 these_var3 these_var4 these_var6
values 2 NA NA 5 3 NA 5 NA 1 4 NA NA
RESULTS UPDATE:
I have a microbenchmark below where expr
is the user's handle:
Unit: milliseconds
expr min lq mean median uq max neval cld
old 78.370093 81.038799 90.272721 85.694885 89.304528 1114.03968 500 c
tmfmnk 11.829791 12.697675 13.844833 13.134485 13.623065 34.91430 500 b
s_t 1.476159 1.774409 2.030418 1.873876 2.003681 16.89159 500 a
Upvotes: 0
Views: 291
Reputation: 9525
You can also use reshape2::melt()
then use base R:
library(reshape2)
dats <- melt(df)
rownames(dats) <- paste0(dats$name,'-',dats$variable)
dats <- t(dats)
dats <- dats[-c(1,2),]
dats <- sapply(dats,as.numeric)
dats
these-var1 are-var1 names-var1 these-var3 are-var3 names-var3 these-var4 are-var4 names-var4 these-var6 are-var6
1 2 3 4 NA NA NA NA 5 NA 5
names-var6
NA
edit
Here as data.frame
:
dats <- as.data.frame.matrix(t(as.data.frame.numeric(dats)))
Upvotes: 1
Reputation: 887951
Using dcast
from data.table
which can take multiple value.var
columns
library(data.table)
out <- dcast(setDT(df)[, rn := 1], rn ~ name,
value.var = paste0("var", c(1, 3, 4, 6)))[, rn := NULL][]
setnames(out, sub("([^_]+)_([^_]+)", "\\2_\\1", names(out)))
out
# are_var1 names_var1 these_var1 are_var3 names_var3 these_var3 are_var4 names_var4 these_var4 are_var6 names_var6 these_var6
#1: 2 3 1 NA NA 4 NA 5 NA 5 NA NA
Upvotes: 0
Reputation: 40171
One dplyr
and tidyr
option could be:
df %>%
gather(var, val, -2) %>%
mutate(var = paste(name, var, sep = "_")) %>%
select(-name) %>%
spread(var, val)
are_var1 are_var3 are_var4 are_var6 names_var1 names_var3 names_var4 names_var6
1 2 NA NA 5 3 NA 5 NA
these_var1 these_var3 these_var4 these_var6
1 1 4 NA NA
It should be faster than you original approach, however, there are certainly faster possibilities.
Upvotes: 2