Combining two dataframes to calculate variable when column names are equal to row values

Question

I have two dataframes: one dataframe that has dichotomous variables representing whether a string match is present, and a second dataframe that applies "weights" for that string across different dimensions.

For example, df1 might look like this:

organic    gluten_free    kosher   sugar_free
1          0              0        0
1          1              0        1
1          1              0        1
0          0              1        0
1          0              1        0

The row values in the second dataframe (df2) should match the values of the column names in df1, and with the value in each row representing a weight.

attribute    eco-friendly     healthy 
organic      2                3         
gluten_free  1                4         
kosher       3                3         
sugar_free   2                3

I then want to calculate the product of the weight for each row value in df1 when the colname in df1 is equal to the row value in df2 into separate indices. For clarity, I have included the specific calculations for the eco-friendly index below:

organic    gluten_free    kosher   sugar-free  eco-friendly
1          0              0        0           (1*2 + 0*1 + 0*3 + 0*2)
1          1              0        1           (1*2 + 1*1 + 0*3 + 1*2)
1          1              0        1           (1*2 + 1*1 + 0*3 + 1*2)
0          0              1        0           (0*2 + 0*1 + 1*3 + 0*2)
1          0              1        0           (1*2 + 0*1 + 1*3 + 0*2)

I have written a very ugly and slow loop function to complete this task, but am confident a more elegant solution exists. Some additional sample data is below.

> dput(df1[1:100,]) 
structure(list(organic = c("0", "0", "0", "0", "0", "0", "1",  "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0",  "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0",  "0", "0", "0", "0", "1", "1", "0", "0", "0", "0", "0", "0", "0",  "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0",  "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0",  "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0",  "0", "0"), gluten_free = c("0", "1", "0", "0", "1", "0", "0",  "0", "0", "0", "0", "0", "1", "0", "0", "1", "1", "1", "1", "1",  "0", "0", "0", "0", "1", "0", "0", "0", "0", "0", "0", "0", "0",  "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0",  "0", "0", "0", "0", "1", "1", "0", "1", "1", "1", "1", "1", "1",  "0", "0", "0", "0", "0", "0", "0", "0", "1", "0", "0", "0", "0",  "0", "0", "0", "0", "0", "0", "1", "0", "0", "0", "0", "0", "0",  "0", "0", "0", "0", "1", "1", "0", "0", "0", "0", "0", "1", "0",  "1", "0"), kosher = c("0", "0", "0", "0", "0", "0", "0", "0",  "0", "0", "0", "0", "0", "0", "0", "0", "0", "1", "0", "0", "1",  "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1",  "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1",  "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1",  "1", "1", "1", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0",  "1", "0", "1", "1", "1", "0", "1", "1", "0", "1", "1", "1", "1",  "1", "1", "1", "0", "0", "1", "1", "0", "0", "1", "0", "0", "0",  "1")), row.names = c("2", "3", "4", "5", "6", "7", "8", "9",  "10", "11", "12", "15", "17", "18", "19", "22", "23", "24", "25",  "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36",  "37", "38", "39", "40", "41", "42", "43", "44", "45", "46", "47",  "48", "49", "50", "51", "52", "53", "54", "55", "56", "57", "58",  "59", "60", "61", "62", "63", "64", "65", "66", "67", "68", "69",  "70", "71", "72", "73", "74", "77", "78", "79", "80", "81", "83",  "84", "85", "86", "87", "88", "91", "92", "93", "95", "97", "98",  "101", "103", "105", "106", "108", "117", "124", "125", "127",  "129", "131", "132", "133", "136", "137"), class = "data.frame")

> dput(df2[1:3,]) 
structure(list(attribute = c("organic", "gluten_free", "kosher"), eco_friendly = c(1L, 3L, 2L), healthy = c(2L, 0L, 1L)), row.names = 1:3, class = "data.frame")

acylam · Accepted Answer

We can take the dot product of the two dfs, keeping in mind that the %*% operator only works for numeric matrices:

df1[] <- lapply(df1, as.numeric)
output <- cbind(df1, as.matrix(df1) %*% as.matrix(df2[,-1]))

The downside to this is that the columns in df1 and the rows in df2 have to be in the correct order. To make sure the column and row order matches, we can use the following in place of df2[,-1]:

df2[match(names(df1), df2$attribute),-1]

Output:

> head(output)
  organic gluten_free kosher eco_friendly healthy
2       0           0      0            0       0
3       0           1      0            3       0
4       0           0      0            0       0
5       0           0      0            0       0
6       0           1      0            3       0
7       0           0      0            0       0

Combining two dataframes to calculate variable when column names are equal to row values

Answers (2)

Related Questions