Bruna Amaral
Bruna Amaral

Reputation: 63

Efficient way to create pairs of all combinations of column names as a rows in a data.frame

I have this data.frame in R:

df = data.frame("blue" = c(0,1,1,0,1),
                "yellow" = c(0,0,1,0,1),
                "green" = c(1,1,1,0,0),
                "letter" = c("A","B","C","D","E"),
                "id" = c(23,57,48,3,12))
  blue yellow green Letter ID
1    0      0     1      A 23
2    1      0     1      B 57
3    1      1     1      C 48
4    0      0     0      D  3
5    1      1     0      E 12

and would like to turn it into a data frame with all possible combinations of colors (colnames to rows), maintaining the Letter and ID of each pair, like in here:

   Col_1    Col_2    C1  C2  Letter  ID
1  blue     yellow   0   0   A       23
2  blue     green    0   1   A       23
3  yellow   green    0   1   A       23
4  blue     yellow   1   0   B       57
5  blue     green    1   1   B       57
6  yellow   green    0   1   B       57
7  blue     yellow   1   1   C       48
8  blue     green    1   1   C       48
9  yellow   green    1   1   C       48
10 blue     yellow   0   1   D       3
11 blue     green    0   0   D       3
12 yellow   green    1   0   D       3
13 blue     yellow   1   1   E       12
14 blue     green    1   0   E       12
15 yellow   green    1   0   E       12

Since my database is huge doing so with loops is taking too long. Any suggestions to do it more efficiently?

Thank you.

Upvotes: 2

Views: 915

Answers (2)

Parfait
Parfait

Reputation: 107567

Consider following base solution to dynamically fit any needed set of values:

Data

txt <- '  blue yellow green Letter ID
1    0      0     1      A 23
2    1      0     1      B 57
3    1      1     1      C 48
4    0      0     0      D  3
5    1      1     0      E 12'

df <- read.table(text = txt, header=TRUE)

Solution

# DEFINE VECTOR OF VALUES 
vals <- c("blue", "yellow", "green")

# RESHAPE DATA LONG
rdf <- reshape(df, idvar = c("Letter", "ID"), 
               varying = vals, times = vals, 
               v.names = "C", timevar = "Col1", ids = NULL,
               new.row.names = 1:1E4, direction = "long")

# HELPER DF FOR ALL POSSIBLE COMBNS (AVOID REVERSE DUPLICATES)
col_df <- subset(expand.grid(Col1 = vals, Col2 = vals,
                             stringsAsFactors = FALSE),
                 Col1 < Col2)

# MERGE TWICE FOR EACH SET OF COLs
mdf <- merge(merge(rdf, col_df, by.x="Col1", by.y="Col1"), rdf, 
             by.x=c("Letter", "ID", "Col2"),
             by.y=c("Letter", "ID", "Col1"),
             suffixes = c(1, 2))

# RE-ORDER ROWS AND COLUMNS
mdf <- data.frame(with(mdf, mdf[order(Letter, ID), 
                                c("Letter", "ID", "Col1", "Col2", "C1", "C2")]), 
                  row.names = NULL)

Output

mdf

#    Letter ID  Col1   Col2 C1 C2
# 1       A 23  blue  green  0  1
# 2       A 23  blue yellow  0  0
# 3       A 23 green yellow  1  0
# 4       B 57  blue  green  1  1
# 5       B 57 green yellow  1  0
# 6       B 57  blue yellow  1  0
# 7       C 48  blue  green  1  1
# 8       C 48 green yellow  1  1
# 9       C 48  blue yellow  1  1
# 10      D  3  blue  green  0  0
# 11      D  3 green yellow  0  0
# 12      D  3  blue yellow  0  0
# 13      E 12  blue  green  1  0
# 14      E 12  blue yellow  1  1
# 15      E 12 green yellow  0  1

The known bottleneck is possibly the base::reshape for very large data frames. Here is a faster function using matrix manipulation:

matrix_melt <- function(df1, key, indName, valName) {
  value_cols <- names(df1)[ !(names(df1) %in% key)]
  mat_inds <- matrix(matrix(value_cols, nrow=nrow(df1), ncol=ncol(df1)-2, byrow=TRUE), ncol=1)
  mat_vals <- matrix(df1[value_cols], ncol= 1, byrow = TRUE)

  df2 <- setNames(data.frame(df1[key], unlist(mat_inds), unlist(mat_vals), 
                             row.names = NULL, stringsAsFactors = FALSE),
                  c(key, indName, valName))
  return(df2)
}

rdf <- matrix_melt(df, c("Letter", "ID"), "Col1", "C")
rdf

Upvotes: 0

doubled
doubled

Reputation: 349

Here's a data.table solution using melt that should work well unless the database is absolutely massive, in which case you can always split it by ID, but I'd guess this works fast for your case.

library(data.table)
df = data.frame("blue" = c(0,1,1,0,1),
                "yellow" = c(0,0,1,0,1),
                "green" = c(1,1,1,0,0),
                "letter" = c("A","B","C","D","E"),
                "id" = c(23,57,48,3,12))

#convert to data.table and melt
setDT(df)

df = melt(df, id.vars = c("letter","id"))

#combine blue/yellow, blue/green, and yellow/green
df1 = merge(df[variable == "blue"],df[variable == "yellow"], by = c("letter","id"))
df2 = merge(df[variable == "blue"],df[variable == "green"], by = c("letter","id"))
df3 = merge(df[variable == "yellow"],df[variable == "green"], by = c("letter","id"))

df = rbindlist(list(df1,df2,df3))

#now fix names..
setnames(df, c("variable.x","value.x","variable.y","value.y"), c("col_1","c1","col_2","c2"))

#optionally rearrange cols...
df = df[,.(col_1,col_2,c1,c2,letter,id)]

Upvotes: 2

Related Questions