Reputation: 63
I have this data.frame in R:
df = data.frame("blue" = c(0,1,1,0,1),
"yellow" = c(0,0,1,0,1),
"green" = c(1,1,1,0,0),
"letter" = c("A","B","C","D","E"),
"id" = c(23,57,48,3,12))
blue yellow green Letter ID
1 0 0 1 A 23
2 1 0 1 B 57
3 1 1 1 C 48
4 0 0 0 D 3
5 1 1 0 E 12
and would like to turn it into a data frame with all possible combinations of colors (colnames to rows), maintaining the Letter and ID of each pair, like in here:
Col_1 Col_2 C1 C2 Letter ID
1 blue yellow 0 0 A 23
2 blue green 0 1 A 23
3 yellow green 0 1 A 23
4 blue yellow 1 0 B 57
5 blue green 1 1 B 57
6 yellow green 0 1 B 57
7 blue yellow 1 1 C 48
8 blue green 1 1 C 48
9 yellow green 1 1 C 48
10 blue yellow 0 1 D 3
11 blue green 0 0 D 3
12 yellow green 1 0 D 3
13 blue yellow 1 1 E 12
14 blue green 1 0 E 12
15 yellow green 1 0 E 12
Since my database is huge doing so with loops is taking too long. Any suggestions to do it more efficiently?
Thank you.
Upvotes: 2
Views: 915
Reputation: 107567
Consider following base
solution to dynamically fit any needed set of values:
Data
txt <- ' blue yellow green Letter ID
1 0 0 1 A 23
2 1 0 1 B 57
3 1 1 1 C 48
4 0 0 0 D 3
5 1 1 0 E 12'
df <- read.table(text = txt, header=TRUE)
Solution
# DEFINE VECTOR OF VALUES
vals <- c("blue", "yellow", "green")
# RESHAPE DATA LONG
rdf <- reshape(df, idvar = c("Letter", "ID"),
varying = vals, times = vals,
v.names = "C", timevar = "Col1", ids = NULL,
new.row.names = 1:1E4, direction = "long")
# HELPER DF FOR ALL POSSIBLE COMBNS (AVOID REVERSE DUPLICATES)
col_df <- subset(expand.grid(Col1 = vals, Col2 = vals,
stringsAsFactors = FALSE),
Col1 < Col2)
# MERGE TWICE FOR EACH SET OF COLs
mdf <- merge(merge(rdf, col_df, by.x="Col1", by.y="Col1"), rdf,
by.x=c("Letter", "ID", "Col2"),
by.y=c("Letter", "ID", "Col1"),
suffixes = c(1, 2))
# RE-ORDER ROWS AND COLUMNS
mdf <- data.frame(with(mdf, mdf[order(Letter, ID),
c("Letter", "ID", "Col1", "Col2", "C1", "C2")]),
row.names = NULL)
Output
mdf
# Letter ID Col1 Col2 C1 C2
# 1 A 23 blue green 0 1
# 2 A 23 blue yellow 0 0
# 3 A 23 green yellow 1 0
# 4 B 57 blue green 1 1
# 5 B 57 green yellow 1 0
# 6 B 57 blue yellow 1 0
# 7 C 48 blue green 1 1
# 8 C 48 green yellow 1 1
# 9 C 48 blue yellow 1 1
# 10 D 3 blue green 0 0
# 11 D 3 green yellow 0 0
# 12 D 3 blue yellow 0 0
# 13 E 12 blue green 1 0
# 14 E 12 blue yellow 1 1
# 15 E 12 green yellow 0 1
The known bottleneck is possibly the base::reshape
for very large data frames. Here is a faster function using matrix manipulation:
matrix_melt <- function(df1, key, indName, valName) {
value_cols <- names(df1)[ !(names(df1) %in% key)]
mat_inds <- matrix(matrix(value_cols, nrow=nrow(df1), ncol=ncol(df1)-2, byrow=TRUE), ncol=1)
mat_vals <- matrix(df1[value_cols], ncol= 1, byrow = TRUE)
df2 <- setNames(data.frame(df1[key], unlist(mat_inds), unlist(mat_vals),
row.names = NULL, stringsAsFactors = FALSE),
c(key, indName, valName))
return(df2)
}
rdf <- matrix_melt(df, c("Letter", "ID"), "Col1", "C")
rdf
Upvotes: 0
Reputation: 349
Here's a data.table
solution using melt
that should work well unless the database is absolutely massive, in which case you can always split it by ID, but I'd guess this works fast for your case.
library(data.table)
df = data.frame("blue" = c(0,1,1,0,1),
"yellow" = c(0,0,1,0,1),
"green" = c(1,1,1,0,0),
"letter" = c("A","B","C","D","E"),
"id" = c(23,57,48,3,12))
#convert to data.table and melt
setDT(df)
df = melt(df, id.vars = c("letter","id"))
#combine blue/yellow, blue/green, and yellow/green
df1 = merge(df[variable == "blue"],df[variable == "yellow"], by = c("letter","id"))
df2 = merge(df[variable == "blue"],df[variable == "green"], by = c("letter","id"))
df3 = merge(df[variable == "yellow"],df[variable == "green"], by = c("letter","id"))
df = rbindlist(list(df1,df2,df3))
#now fix names..
setnames(df, c("variable.x","value.x","variable.y","value.y"), c("col_1","c1","col_2","c2"))
#optionally rearrange cols...
df = df[,.(col_1,col_2,c1,c2,letter,id)]
Upvotes: 2