Peter.2055
Peter.2055

Reputation: 7

Multiplying and summarising values from two data frames R

I have two data frames, df1 and df2, an example their structure is shown below:

df1:

   ID          A1     A2         A3        A4      A5
1   1 0.000000000 0.0000 0.00000000 0.0000000 0.00000
2   2 0.000000000 0.0000 0.00000000 0.0000000 0.00000
3   3 0.000000000 0.0000 0.00000000 0.0000000 0.00000
4   4 0.001775742 0.0000 0.00000000 0.0000000 0.00000
5   5 0.201775742 0.0000 0.00000000 0.0000000 0.00000
6   6 0.000000000 0.0367 0.00000000 0.0000000 0.00000
7   7 0.000000000 0.0369 0.00000000 0.0000000 0.00000
8   8 0.000000000 0.0371 0.00000000 0.0000000 0.00000
9   9 0.000000000 0.0000 0.00000000 0.0000000 0.00000
10 10 0.000000000 0.0000 0.01448181 0.0000000 0.00000
11 11 0.000000000 0.0000 0.01648181 0.0000000 0.00000
12 12 0.000000000 0.0000 0.00000000 0.0000000 0.00000
13 13 0.000000000 0.0000 0.00000000 0.0012056 0.00000
14 14 0.000000000 0.0000 0.00000000 0.0012036 0.00000
15 15 0.000000000 0.0000 0.00000000 0.0012016 0.00000
16 16 0.000000000 0.0000 0.00000000 0.0000000 0.00000
17 17 0.000000000 0.0000 0.00000000 0.0000000 0.00000
18 18 0.000000000 0.0000 0.00000000 0.0000000 0.04976
19 19 0.000000000 0.0000 0.00000000 0.0000000 0.04978

df2:

   ID    C1     C2    C3
1   1 0.020 0.3540 0.258
2   2 0.030 0.3542 0.273
3   3 0.002 0.3544 0.288
4   4 0.022 0.3546 0.303
5   5 0.045 0.3548 0.318
6   6 0.046 0.3550 0.333
7   7 0.047 0.3552 0.348
8   8 0.048 0.3554 0.363
9   9 0.049 0.3556 0.378
10 10 0.050 0.3558 0.393
11 11 0.051 0.3560 0.408
12 12 0.052 0.3562 0.423
13 13 0.053 0.3564 0.438
14 14 0.054 0.3566 0.453
15 15 0.055 0.3568 0.468
16 16 0.056 0.3570 0.483
17 17 0.057 0.3572 0.498
18 18 0.058 0.3574 0.513
19 19 0.059 0.3576 0.528

I would like to be able to perform some calculations using both data frames, but unsure on how to accomplish this.

Firstly, I need to multiply column C1 in df2 with columns A1:A5 in df1, such that when there is a match on the ID in both data frames, there is a value for the multiplication, e.g., ID 4 and 5. The output would be something like this:

df3:

   ID        C1xA1     C1xA2       C1xA3       C1xA4      C1xA5
1   1 0.0000000000 0.0000000 0.000000000 0.00000e+00 0.00000000
2   2 0.0000000000 0.0000000 0.000000000 0.00000e+00 0.00000000
3   3 0.0000000000 0.0000000 0.000000000 0.00000e+00 0.00000000
4   4 0.0000390663 0.0000000 0.000000000 0.00000e+00 0.00000000
5   5 0.0090799080 0.0000000 0.000000000 0.00000e+00 0.00000000
6   6 0.0000000000 0.0016882 0.000000000 0.00000e+00 0.00000000
7   7 0.0000000000 0.0017343 0.000000000 0.00000e+00 0.00000000
8   8 0.0000000000 0.0017808 0.000000000 0.00000e+00 0.00000000
9   9 0.0000000000 0.0000000 0.000000000 0.00000e+00 0.00000000
10 10 0.0000000000 0.0000000 0.005152628 0.00000e+00 0.00000000
11 11 0.0000000000 0.0000000 0.005867524 0.00000e+00 0.00000000
12 12 0.0000000000 0.0000000 0.000000000 0.00000e+00 0.00000000
13 13 0.0000000000 0.0000000 0.000000000 6.38968e-05 0.00000000
14 14 0.0000000000 0.0000000 0.000000000 6.49944e-05 0.00000000
15 15 0.0000000000 0.0000000 0.000000000 6.60880e-05 0.00000000
16 16 0.0000000000 0.0000000 0.000000000 0.00000e+00 0.00000000
17 17 0.0000000000 0.0000000 0.000000000 0.00000e+00 0.00000000
18 18 0.0000000000 0.0000000 0.000000000 0.00000e+00 0.00288608
19 19 0.0000000000 0.0000000 0.000000000 0.00000e+00 0.00293702

Then, I need to sum each of the cols in df3, e.g., C1xA1, and then divide these by the sum of the ID col in df1 - leaving me with 5 values. I then need to apply the same method with columns C2 and C3 from df2. The output would be something like this:

df4:

  sum_C1xA.sum_A.ID sum_C2xA.sum_A.ID sum_C3xA.sum_A.ID
1       4.79946e-05       3.80104e-04       3.40541e-04
2       2.73858e-05       2.06951e-04       2.02787e-04
3       5.80008e-05       5.80008e-05       6.53470e-05
4       1.02621e-06       6.77690e-06       8.60859e-06
5       3.06479e-05       1.87292e-04       2.72688e-04

I do not need the output from df3, its only shown as an example of what I am trying to explain.

I have managed to compute this in excel, but this is becoming very repetitive and I cant work out how to implement this in R. Any help would be appreciated. Below is the structure of the data frames mentioned above.

df1<-structure(list(ID = 1:19,
                    A1 = c(0, 0, 0, 0.001775742, 0.201775742,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
                    A2 = c(0, 0, 0, 0,0, 0.0367, 0.0369, 0.0371, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), 
                    A3 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0.01448181, 0.01648181,0, 0, 0, 0, 0, 0, 0, 0),
                    A4 = c(0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0.0012056, 0.0012036, 0.0012016, 0, 0, 0, 0),
                    A5 = c(0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.04976,0.04978)),
               class = "data.frame", row.names = c(NA, -19L))

df2<-structure(list(ID = 1:19,
                    C1 = c(0.02, 0.03, 0.002, 0.022,0.045, 0.046, 0.047, 0.048, 0.049, 0.05, 0.051, 0.052, 0.053,0.054, 0.055, 0.056, 0.057, 0.058, 0.059),
                    C2 = c(0.354, 0.3542,0.3544, 0.3546, 0.3548, 0.355, 0.3552, 0.3554, 0.3556, 0.3558,0.356, 0.3562, 0.3564, 0.3566, 0.3568, 0.357, 0.3572, 0.3574,0.3576),
                    C3 = c(0.258, 0.273, 0.288, 0.303, 0.318, 0.333, 0.348,0.363, 0.378, 0.393, 0.408, 0.423, 0.438, 0.453, 0.468, 0.483,0.498, 0.513, 0.528)),
               class = "data.frame", row.names = c(NA,-19L))

df3<-structure(list(ID = 1:19, C1xA1 = c(0, 0, 0, 3.90663e-05, 0.009079908,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
                    C1xA2 = c(0, 0, 0,0, 0, 0.0016882, 0.0017343, 0.0017808, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0),
                    C1xA3 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0.005152628, 0.005867524,0, 0, 0, 0, 0, 0, 0, 0),
                    C1xA4 = c(0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 6.38968e-05, 6.49944e-05, 6.6088e-05, 0, 0, 0, 0),
                    C1xA5 = c(0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.00288608, 0.00293702)),
               class = "data.frame", row.names = c(NA, -19L))

df4<-structure(list(sum_C1xA.sum_A.ID = c(4.79946e-05, 2.73858e-05,5.80008e-05, 1.02621e-06, 3.06479e-05),
                    sum_C2xA.sum_A.ID = c(0.000380104,0.000206951, 5.80008e-05, 6.7769e-06, 0.000187292),
                    sum_C3xA.sum_A.ID = c(0.000340541,0.000202787, 6.5347e-05, 8.60859e-06, 0.000272688)),
               class = "data.frame", row.names = c(NA,-5L))

Upvotes: 0

Views: 54

Answers (2)

jkatam
jkatam

Reputation: 3447

Alternatively

df1 %>% inner_join(df2, by='ID') %>% 
  summarise(sum=sum(ID),
            across(c(A1:A5), ~sum(.x*C1)/sum, .names = 'C1_{col}'),
            across(c(A1:A5), ~sum(.x*C2)/sum, .names = 'C2_{col}'),
            across(c(A1:A5), ~sum(.x*C3)/sum, .names = 'C3_{col}')) %>% 
  pivot_longer(cols = starts_with('C'),names_to = 'name', values_to = 'value', values_transform = as.numeric) %>% 
  separate(col=name,into=c('C','A'),sep = '\\_') %>% 
  pivot_wider(id_cols = A, names_from = C, values_from = value)

Created on 2023-07-04 with reprex v2.0.2

# A tibble: 5 × 4
  A             C1         C2         C3
  <chr>      <dbl>      <dbl>      <dbl>
1 A1    0.0000480  0.000380   0.000341  
2 A2    0.0000274  0.000207   0.000203  
3 A3    0.00000824 0.0000580  0.0000653 
4 A4    0.00000103 0.00000678 0.00000861
5 A5    0.0000306  0.000187   0.000273  

Upvotes: 0

brinyprawn
brinyprawn

Reputation: 450

The process you are describing is matrix multiplication, and it can be done in R using %*%

For example:

A <- as.matrix(df1[,-1]) # remove the id
B <- as.matrix(df2[,-1]) # remove the id

t(A) %*% B

Which results in your desired output without the scaling by the constant factor. The sum of 1:19 is 19*10, so your output can be done with this:

> t(A) %*% B/(19*10)

# Output
#              C1           C2           C3
# A1 4.799460e-05 3.801037e-04 3.405407e-04
# A2 2.738579e-05 2.069512e-04 2.027874e-04
# A3 8.235067e-06 5.800080e-05 6.534700e-05
# A4 1.026206e-06 6.776897e-06 8.608592e-06
# A5 3.064789e-05 1.872924e-04 2.726880e-04

Upvotes: 2

Related Questions