Merge information between two df in R

Question

I would need help in order to add a new column into this table:

 > table1
       parent node        blgth label
    1      19    1 0.1357120409  SP_A
    2      19    2 0.0794631914  SP_B
    3      18    3 0.0861335837  SP_C
    4      17    4 0.6693699494  SP_D
    5      21    5 0.9480669221  SP_E
    6      21    6 1.1270609760  SP_F
    7      20    7 0.4409983357  SP_G
    8      20    8 0.4412279821  SP_H
    9      16    9 1.1449969710  SP_I
    10     22   10 1.0935878043  SP_J
    11     23   11 0.0000025842  SP_K
    12     23   12 0.0053554342  SP_L
    13     13   13           NA  Root
    14     13   14 0.4741914879  71.5
    15     14   15 0.0606263444  87.6
    16     15   16 0.1427482127  40.4
    17     16   17 0.0836374533   100
    18     17   18 0.4057476840  99.8
    19     18   19 0.0979910062  87.7
    20     15   20 0.4931972115   100
    21     14   21 0.0957999921  73.1
    22     13   22 0.7289752119   100
    23     22   23 0.6091491479   100

The idea is to parse another table:

> table2
   label  VALUE Event
1   SP_E     NA     1
2   SP_G 0.0026     4
3   SP_H 0.0026     4
4   SP_I     NA     3
5   SP_F     NA     2
6   SP_A 0.0970     5
7   SP_B 0.0970     5
8   SP_C 0.0970     5
9   SP_E     NA     1
10  SP_G 0.0026     4
11  SP_H 0.0026     4
12  SP_F     NA     2
13  SP_A 0.0970     5
14  SP_B 0.0970     5
15  SP_C 0.0970     5
16  SP_E     NA     1
17  SP_G 0.0026     4
18  SP_H 0.0026     4
19  SP_F     NA     2
20  SP_A 0.0970     5
21  SP_B 0.0970     5
22  SP_C 0.0970     5
23  SP_E     NA     1
24  SP_G 0.0026     4
25  SP_H 0.0026     4
26  SP_F     NA     2
27  SP_A 0.0970     5
28  SP_B 0.0970     5
29  SP_C 0.0970     5

The idea is to first within the table2 groupby Event in order to see which label are in the same Event:

for instance:

SP_A and SP_B are in the Event 5

their VALUE is 0.0970, and the idea is to check in the table1 the parent numbers associated with this 3 SP labels: here the parent values are 19 and 18, then I simply add the VALUE (0.0970) to the lowest node number (here 18) into a new column VALUE in table1:

parent  node blgth      label  VALUE
19  1   0.1357120409    SP_A 
19  2   0.0794631914    SP_B
18  3   0.0861335837    SP_C
17  4   0.6693699494    SP_D
21  5   0.9480669221    SP_E
21  6   1.1270609760    SP_F
20  7   0.4409983357    SP_G
20  8   0.4412279821    SP_H
16  9   1.1449969710    SP_I
22  10  1.0935878043    SP_J
23  11  0.0000025842    SP_K
23  12  0.0053554342    SP_L
13  13  NA  Root
13  14  0.4741914879    71.5
14  15  0.0606263444    87.6
15  16  0.1427482127    40.4
16  17  0.0836374533    100
17  18  0.4057476840    99.8  0.0970
18  19  0.0979910062    87.7
15  20  0.4931972115    100
14  21  0.0957999921    73.1
13  22  0.7289752119    100
22  23  0.6091491479    100

then I do the same for Event 4:

SP_G and SP_H are in the Event 4

their VALUE is 0.0026, the parent value in table1 is 20, then I add the VALUE 0.0026 to the corresponding node number 20 into a new column VALUE in table1:

parent  node blgth      label  VALUE
19  1   0.1357120409    SP_A 
19  2   0.0794631914    SP_B
18  3   0.0861335837    SP_C
17  4   0.6693699494    SP_D
21  5   0.9480669221    SP_E
21  6   1.1270609760    SP_F
20  7   0.4409983357    SP_G
20  8   0.4412279821    SP_H
16  9   1.1449969710    SP_I
22  10  1.0935878043    SP_J
23  11  0.0000025842    SP_K
23  12  0.0053554342    SP_L
13  13  NA  Root
13  14  0.4741914879    71.5
14  15  0.0606263444    87.6
15  16  0.1427482127    40.4
16  17  0.0836374533    100
17  18  0.4057476840    99.8  0.0970
18  19  0.0979910062    87.7  
15  20  0.4931972115    100   0.0026
14  21  0.0957999921    73.1
13  22  0.7289752119    100
22  23  0.6091491479    100

If I do the process for the other one (they are lonely within event), I will simply add NA values (because their VALUE is NA in table2 anyway)

here is the expected final result:

parent  node blgth      label  Mean_dNdS
19  1   0.1357120409    SP_A 
19  2   0.0794631914    SP_B
18  3   0.0861335837    SP_C
17  4   0.6693699494    SP_D
21  5   0.9480669221    SP_E
21  6   1.1270609760    SP_F
20  7   0.4409983357    SP_G
20  8   0.4412279821    SP_H
16  9   1.1449969710    SP_I
22  10  1.0935878043    SP_J
23  11  0.0000025842    SP_K
23  12  0.0053554342    SP_L
13  13  NA  Root
13  14  0.4741914879    71.5  NA
14  15  0.0606263444    87.6  NA
15  16  0.1427482127    40.4  NA
16  17  0.0836374533    100   NA
17  18  0.4057476840    99.8  0.0970
18  19  0.0979910062    87.7  
15  20  0.4931972115    100   0.0026
14  21  0.0957999921    73.1  NA
13  22  0.7289752119    100   NA
22  23  0.6091491479    100   NA

Does someone would have and idea? I guess dplyr or a tydiverse method could be useful since it include groupby loop?

data:

table1

structure(list(parent = c(19L, 19L, 18L, 17L, 21L, 21L, 20L, 
20L, 16L, 22L, 23L, 23L, 13L, 13L, 14L, 15L, 16L, 17L, 18L, 15L, 
14L, 13L, 22L), node = 1:23, blgth = c(0.1357120409, 0.0794631914, 
0.0861335837, 0.6693699494, 0.9480669221, 1.127060976, 0.4409983357, 
0.4412279821, 1.144996971, 1.0935878043, 2.5842e-06, 0.0053554342, 
NA, 0.4741914879, 0.0606263444, 0.1427482127, 0.0836374533, 0.405747684, 
0.0979910062, 0.4931972115, 0.0957999921, 0.7289752119, 0.6091491479
), label = structure(c(9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 
17L, 18L, 19L, 20L, 8L, 3L, 5L, 2L, 1L, 7L, 6L, 1L, 4L, 1L, 1L
), .Label = c("100", "40.4", "71.5", "73.1", "87.6", "87.7", 
"99.8", "Root", "SP_A", "SP_B", "SP_C", "SP_D", "SP_E", "SP_F", 
"SP_G", "SP_H", "SP_I", "SP_J", "SP_K", "SP_L"), class = "factor")), class = "data.frame", row.names = c(NA, 
-23L))

table2

structure(list(label = structure(c(4L, 6L, 7L, 8L, 5L, 1L, 2L, 
3L, 4L, 6L, 7L, 5L, 1L, 2L, 3L, 4L, 6L, 7L, 5L, 1L, 2L, 3L, 4L, 
6L, 7L, 5L, 1L, 2L, 3L), .Label = c("SP_A", "SP_B", "SP_C", "SP_E", 
"SP_F", "SP_G", "SP_H", "SP_I"), class = "factor"), VALUE = c(NA, 
0.0026, 0.0026, NA, NA, 0.097, 0.097, 0.097, NA, 0.0026, 0.0026, 
NA, 0.097, 0.097, 0.097, NA, 0.0026, 0.0026, NA, 0.097, 0.097, 
0.097, NA, 0.0026, 0.0026, NA, 0.097, 0.097, 0.097), Event = c(1L, 
4L, 4L, 3L, 2L, 5L, 5L, 5L, 1L, 4L, 4L, 2L, 5L, 5L, 5L, 1L, 4L, 
4L, 2L, 5L, 5L, 5L, 1L, 4L, 4L, 2L, 5L, 5L, 5L)), class = "data.frame", row.names = c(NA, 
-29L))

Bas · Accepted Answer

I think this is what you want.
We join table1 to table2 by label to be able to use the parent column. Then, we group by Event and take the mean VALUE (since it is always the same we could've taken min or first as well), and take the smallest value of parent.

library(tidyverse)

table3 <- table2 %>% 
  left_join(table1, by = "label") %>% 
  group_by(Event) %>% 
  summarize(Mean_dNdS = mean(VALUE),
            parent = min(parent)) %>% 
  select(parent, Mean_dNdS)

This gives

 parent Mean_dNdS
       
1    21   NA     
2    21   NA     
3    16   NA     
4    20    0.0026
5    18    0.097

Then we can simply join this table to table1 and get the result you want.

table1 %>% 
  left_join(table3, by = c("node" = "parent"))

giving

   parent node        blgth label Mean_dNdS
1      19    1 0.1357120409  SP_A        NA
2      19    2 0.0794631914  SP_B        NA
3      18    3 0.0861335837  SP_C        NA
4      17    4 0.6693699494  SP_D        NA
5      21    5 0.9480669221  SP_E        NA
6      21    6 1.1270609760  SP_F        NA
7      20    7 0.4409983357  SP_G        NA
8      20    8 0.4412279821  SP_H        NA
9      16    9 1.1449969710  SP_I        NA
10     22   10 1.0935878043  SP_J        NA
11     23   11 0.0000025842  SP_K        NA
12     23   12 0.0053554342  SP_L        NA
13     13   13           NA  Root        NA
14     13   14 0.4741914879  71.5        NA
15     14   15 0.0606263444  87.6        NA
16     15   16 0.1427482127  40.4        NA
17     16   17 0.0836374533   100        NA
18     17   18 0.4057476840  99.8    0.0970
19     18   19 0.0979910062  87.7        NA
20     15   20 0.4931972115   100    0.0026
21     14   21 0.0957999921  73.1        NA
22     14   21 0.0957999921  73.1        NA
23     13   22 0.7289752119   100        NA
24     22   23 0.6091491479   100        NA

Merge information between two df in R

Answers (1)

Related Questions