Reputation: 1245
I have a following problme. I have two dataframes. In the second one. there are conditions about how a new column in the first dataframe should be calculated. See example bellow: First df:
df1 <- data.frame(country = c("01", "01", "02", "03", "03", "03" , "04", "05"),
date = c("2020-01-01", "2020-01-02", "2020-01-02", "2020-01-02", "2020-01-03", "2020-01-04", "2020-01-01", "2020-01-02"),
value = c(4, 3, 2, -3, 1.5, 12, 10, 15),
blabla = c(23, 41, 32, 8, 50, 27, 8, 7)
)
Second df:
df2 <- data.frame( country = c("01", "02", "03", "04", "05" ),
match_country1 = c("02", "03", "01", "01", "01"),
match_country2 = c("03", "04", "02", "02", "03"),
match_country3 = c("05", "05", "04", "03", "04")
)
Now I need to compute a new_value that is an average of three values as defined in df2. I need to respect a date in df1. For example, new_value for country "01" and date "2020-01-01" is an average of a value from country "02", country "03", country "05" all from date "2020-01-01".
Desired output is below:
new_df <- data.frame(country = c("01", "01", "02", "03", "03", "03" , "04", "05"),
date = c("2020-01-01", "2020-01-02", "2020-01-02", "2020-01-02", "2020-01-03", "2020-01-04", "2020-01-01", "2020-01-02"),
value = c(4, 3, 2, -3, 1.5, 12, 10, 15),
blabla = c(23, 41, 32, 8, 50, 27, 8, 7),
new_value = c(NA, #because no data for 02, 03, 05 from 2020-01-01
(2-3+15)/3,
(-3+15)/2, #because no data for 04 from 2020-01-02
(3+2)/2, #because no data for 04 from 2020-01-02
NA, #because no data for 01, 02, 04 from 2020-01-03
NA, #because no data for 01, 02, 04 from 2020-01-04
4, #because no data for 02, 03 from 2020-01-01
(3-3)/2 #because no data for 04 from 2020-01-02
)
)
How can I do this, please?
Upvotes: 1
Views: 76
Reputation: 269754
This can be done using an SQL triple join. For each row in df1 get the matching country row in df2 via left join and then get all the rows in the b instance of df1 for which the date is the same and there is a country match in df2. Then take the average b value in the matching rows.
library(sqldf)
sqldf("select a.*, avg(b.value) new_value
from df1 a
left join df2 c on a.country = c.country
left join df1 b on a.date = b.date and
b.country in (c.match_country1, c.match_country2, c.match_country3)
group by a.rowid")
giving this data frame:
country date value blabla new_value
1 01 2020-01-01 4.0 23 NA
2 01 2020-01-02 3.0 41 4.666667
3 02 2020-01-02 2.0 32 6.000000
4 03 2020-01-02 -3.0 8 2.500000
5 03 2020-01-03 1.5 50 NA
6 03 2020-01-04 12.0 27 NA
7 04 2020-01-01 10.0 8 4.000000
8 05 2020-01-02 15.0 7 0.000000
Here are two variations. The first generates the in (...)
string as matches
and substitutes it in and the second converts df2
to long form, long
first.
matches <- toString(names(df2)[-1])
fn$sqldf("select a.*, avg(b.value) new_value
from df1 a
left join df2 c on a.country = c.country
left join df1 b on a.date = b.date and b.country in ($matches)
group by a.rowid")
varying <- list(match_country = names(df2)[-1])
long <- reshape(df2, dir = "long", varying = varying, v.names = names(varying))
sqldf("select a.*, avg(b.value) new_value
from df1 a
left join long c on a.country = c.country
left join df1 b on a.date = b.date and b.country = c.match_country
group by a.rowid")
Upvotes: 2
Reputation: 76470
Though there already is an accepted answer, here is a base R, since the two answers posted (2nd) require external packages.
df1$new_value <- with(df1, ave(seq_len(n), date, FUN = function(i){
mrg <- merge(df1[i, ], df2)
j <- grep("^match", names(mrg))
ctry <- unique(df1[i, "country"])
apply(mrg[j], 1, function(row){
k <- match(row, ctry)
if(any(!is.na(k)))
mean(mrg[k, "value"], na.rm = TRUE)
else NA_real_
})
}))
identical(df1$new_value, new_df$new_value)
#[1] TRUE
Upvotes: 1
Reputation: 26218
This tidyverse approach may help
df1
#> country date value blabla
#> 1 01 2020-01-01 4.0 23
#> 2 01 2020-01-02 3.0 41
#> 3 02 2020-01-02 2.0 32
#> 4 03 2020-01-02 -3.0 8
#> 5 03 2020-01-03 1.5 50
#> 6 03 2020-01-04 12.0 27
#> 7 04 2020-01-01 10.0 8
#> 8 05 2020-01-02 15.0 7
df2
#> country match_country1 match_country2 match_country3
#> 1 01 02 03 05
#> 2 02 03 04 05
#> 3 03 01 02 04
#> 4 04 01 02 03
#> 5 05 01 03 04
suppressMessages(library(tidyverse))
df1 %>%
left_join(df2, by = 'country') %>%
nest(data = !date) %>%
mutate(data = map(data, ~.x %>%
mutate(across(contains('match'), ~value[match(., country)])) %>%
rowwise() %>%
mutate(avg = mean(c_across(contains('match')), na.rm = T)) %>%
select(!contains('match'))
)
) %>%
unnest(data)
#> # A tibble: 8 x 5
#> date country value blabla avg
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 2020-01-01 01 4 23 NaN
#> 2 2020-01-01 04 10 8 4
#> 3 2020-01-02 01 3 41 4.67
#> 4 2020-01-02 02 2 32 6
#> 5 2020-01-02 03 -3 8 2.5
#> 6 2020-01-02 05 15 7 0
#> 7 2020-01-03 03 1.5 50 NaN
#> 8 2020-01-04 03 12 27 NaN
Created on 2021-05-02 by the reprex package (v2.0.0)
Upvotes: 2