Homer Jay Simpson
Homer Jay Simpson

Reputation: 1218

How can I match characters values of two data frames and apply a function corresponding to this match in R?

I have two data frames that look like this :


library(tidyverse)
date = c(rep(as.Date("2022-01-01"),4),rep(as.Date("2022-02-01"),4),rep(as.Date("2022-03-01"),3))
A = seq(1,11,1)
B = seq(12,22,1)
C = seq(23,33,1)
symbol = c(rep("A",4),rep("B,C",4),rep("D,A",3))
df1 = tibble(date,A,B,C,symbol);df1
# A tibble: 11 × 5
   date           A     B     C symbol
   <date>     <dbl> <dbl> <dbl> <chr> 
 1 2022-01-01     1    12    23 A     
 2 2022-01-01     2    13    24 A     
 3 2022-01-01     3    14    25 A     
 4 2022-01-01     4    15    26 A     
 5 2022-02-01     5    16    27 B,C   
 6 2022-02-01     6    17    28 B,C   
 7 2022-02-01     7    18    29 B,C   
 8 2022-02-01     8    19    30 B,C   
 9 2022-03-01     9    20    31 D,A   
10 2022-03-01    10    21    32 D,A   
11 2022-03-01    11    22    33 D,A   

and the second one :


first = c("A","A","B","B","B")
second = c("D","F","A","C","D")
value = c(2,3,4,3,5)
df2 = tibble(first,second,value);df2
# A tibble: 5 × 3
  first second value
  <chr> <chr>  <dbl>
1 A     D          2
2 A     F          3
3 B     A          4
4 B     C          3
5 B     D          5

I want from the first data frame to search the values of columns symbol in the first and second columns of data frame 2 (df2).If symbol in df1 has no pair to take the maximum value of columns A,B,C submatrix in df1 which is 26.

If there is a match for a pair, for example B,C in df1 and df2 to take the minimum of row corresponding to value of df2 in df1.In my example B,C in column symbol exist in df2 and has the value 3 therefore in the third row of sub matrix corresponding to B,C in df1 is 7,18,29 and the minimum is 7. The for D,A I have to take the minimum in row 2 in the D,A sub matrix in df1 which is 10,21,32 with minimum 10.

finally I want ideally the reported data frame to be like this:

symbol calc
A 26
B,C 7
D,A 10

How can I do this in R (using dplyr) ?

Upvotes: 3

Views: 495

Answers (4)

ThomasIsCoding
ThomasIsCoding

Reputation: 102241

We can use igraph to construct a graph with edges listed in df2 and then find the matched rows/values in df1 in turn

library(igraph)
library(data.table)

g <- graph_from_data_frame(df2, directed = FALSE)

setDT(df1)[
  ,
  .(calc = sapply(strsplit(symbol, ","), function(x) {
    if (length(x) == 1 || get.edge.ids(g, x) == 0) {
      max(unlist(.SD))
    } else {
      min(.SD[df2$value[get.edge.ids(g, x)]])
    }
  })),
  symbol,
  .SDcols = c("A", "B", "C")
]

and you will see

   symbol calc
1:      A   26
2:    B,C    7
3:    D,A   10

Upvotes: 1

Onyambu
Onyambu

Reputation: 79288

df1 %>%
  separate(symbol, c("first", "second"), fill = "right", remove = FALSE)%>%
  left_join(rbind(df2, transform(df2, first = second, second = first)))%>%
  group_by(symbol)%>%
  summarise(calc = if(is.na(value[1])) max(c_across(A:C)) 
              else pmin(c_across(A:C))[value[1]])
  
# A tibble: 3 × 2
  symbol  calc
  <chr>  <dbl>
1 A         26
2 B,C        7
3 D,A       10

Upvotes: 5

denis
denis

Reputation: 5673

Here is a solution

library(dplyr)

df1 %>%
  group_by(symbol) %>%
  summarise({
    tmp <- str_split(symbol[1],",")[[1]]
    if(length(tmp) == 1){
      max(A,B,C)
    }else{
      i <- df2 %>%
        filter((first == tmp[1] & second == tmp[2]) | (second == tmp[1] & first == tmp[2])) %>%
        pull(value)
      min(A[i],B[i],C[i])}
  })


# A tibble: 3 x 2
  symbol `{ ... }`
  <chr>      <dbl>
1 A             26
2 B,C            7
3 D,A           10

I actually implemented what you describe. I group per symbol, so I work in each "submatrix" as you call them

I first split you symbol:

tmp <- str_split(symbol[1],",")[[1]]

If there is only one symbol, I take the maximum of A, B, C:

if(length(tmp) == 1){
      max(A,B,C)
    }

If not, I find in df2 the line value, and take the minimum of A, B, and C of this line:

else{
      i <- df2 %>%
        filter((first == tmp[1] & second == tmp[2]) | (second == tmp[1] & first == tmp[2])) %>%
        pull(value)
      min(A[i],B[i],C[i])}

Upvotes: 2

sasfan_1924
sasfan_1924

Reputation: 1

df1 %>% 
  left_join(
    df2 %>% 
      mutate(symbol = paste0(first, “,”, second), 
      by = “symbol”)
    ) %>% 
  mutate(calc = coalesce(value, pmax(A, B, C)) %>%
  select(symbol, value) %>%
  group_by(symbol) %>%
  filter(row_number() == 1)

Upvotes: -1

Related Questions