Reputation: 1218
I have two data frames that look like this :
library(tidyverse)
date = c(rep(as.Date("2022-01-01"),4),rep(as.Date("2022-02-01"),4),rep(as.Date("2022-03-01"),3))
A = seq(1,11,1)
B = seq(12,22,1)
C = seq(23,33,1)
symbol = c(rep("A",4),rep("B,C",4),rep("D,A",3))
df1 = tibble(date,A,B,C,symbol);df1
# A tibble: 11 × 5
date A B C symbol
<date> <dbl> <dbl> <dbl> <chr>
1 2022-01-01 1 12 23 A
2 2022-01-01 2 13 24 A
3 2022-01-01 3 14 25 A
4 2022-01-01 4 15 26 A
5 2022-02-01 5 16 27 B,C
6 2022-02-01 6 17 28 B,C
7 2022-02-01 7 18 29 B,C
8 2022-02-01 8 19 30 B,C
9 2022-03-01 9 20 31 D,A
10 2022-03-01 10 21 32 D,A
11 2022-03-01 11 22 33 D,A
and the second one :
first = c("A","A","B","B","B")
second = c("D","F","A","C","D")
value = c(2,3,4,3,5)
df2 = tibble(first,second,value);df2
# A tibble: 5 × 3
first second value
<chr> <chr> <dbl>
1 A D 2
2 A F 3
3 B A 4
4 B C 3
5 B D 5
I want from the first data frame to search the values of columns symbol in the first and second columns of data frame 2 (df2).If symbol in df1 has no pair to take the maximum value of columns A,B,C submatrix in df1 which is 26.
If there is a match for a pair, for example B,C in df1 and df2 to take the minimum of row corresponding to value of df2 in df1.In my example B,C in column symbol exist in df2 and has the value 3 therefore in the third row of sub matrix corresponding to B,C in df1 is 7,18,29 and the minimum is 7. The for D,A I have to take the minimum in row 2 in the D,A sub matrix in df1 which is 10,21,32 with minimum 10.
finally I want ideally the reported data frame to be like this:
symbol | calc |
---|---|
A | 26 |
B,C | 7 |
D,A | 10 |
How can I do this in R (using dplyr) ?
Upvotes: 3
Views: 495
Reputation: 102241
We can use igraph
to construct a graph with edges listed in df2
and then find the matched rows/values in df1
in turn
library(igraph)
library(data.table)
g <- graph_from_data_frame(df2, directed = FALSE)
setDT(df1)[
,
.(calc = sapply(strsplit(symbol, ","), function(x) {
if (length(x) == 1 || get.edge.ids(g, x) == 0) {
max(unlist(.SD))
} else {
min(.SD[df2$value[get.edge.ids(g, x)]])
}
})),
symbol,
.SDcols = c("A", "B", "C")
]
and you will see
symbol calc
1: A 26
2: B,C 7
3: D,A 10
Upvotes: 1
Reputation: 79288
df1 %>%
separate(symbol, c("first", "second"), fill = "right", remove = FALSE)%>%
left_join(rbind(df2, transform(df2, first = second, second = first)))%>%
group_by(symbol)%>%
summarise(calc = if(is.na(value[1])) max(c_across(A:C))
else pmin(c_across(A:C))[value[1]])
# A tibble: 3 × 2
symbol calc
<chr> <dbl>
1 A 26
2 B,C 7
3 D,A 10
Upvotes: 5
Reputation: 5673
Here is a solution
library(dplyr)
df1 %>%
group_by(symbol) %>%
summarise({
tmp <- str_split(symbol[1],",")[[1]]
if(length(tmp) == 1){
max(A,B,C)
}else{
i <- df2 %>%
filter((first == tmp[1] & second == tmp[2]) | (second == tmp[1] & first == tmp[2])) %>%
pull(value)
min(A[i],B[i],C[i])}
})
# A tibble: 3 x 2
symbol `{ ... }`
<chr> <dbl>
1 A 26
2 B,C 7
3 D,A 10
I actually implemented what you describe. I group per symbol, so I work in each "submatrix" as you call them
I first split you symbol:
tmp <- str_split(symbol[1],",")[[1]]
If there is only one symbol, I take the maximum of A, B, C:
if(length(tmp) == 1){
max(A,B,C)
}
If not, I find in df2 the line value, and take the minimum of A, B, and C of this line:
else{
i <- df2 %>%
filter((first == tmp[1] & second == tmp[2]) | (second == tmp[1] & first == tmp[2])) %>%
pull(value)
min(A[i],B[i],C[i])}
Upvotes: 2
Reputation: 1
df1 %>%
left_join(
df2 %>%
mutate(symbol = paste0(first, “,”, second),
by = “symbol”)
) %>%
mutate(calc = coalesce(value, pmax(A, B, C)) %>%
select(symbol, value) %>%
group_by(symbol) %>%
filter(row_number() == 1)
Upvotes: -1