vojtam
vojtam

Reputation: 1245

Make an average based on condition from second df in R

I have a following problme. I have two dataframes. In the second one. there are conditions about how a new column in the first dataframe should be calculated. See example bellow: First df:

df1 <- data.frame(country = c("01", "01", "02", "03", "03", "03" , "04", "05"),
                  date = c("2020-01-01", "2020-01-02", "2020-01-02", "2020-01-02", "2020-01-03", "2020-01-04", "2020-01-01", "2020-01-02"),
                  value = c(4, 3, 2, -3, 1.5, 12, 10, 15),
                  blabla = c(23, 41, 32, 8, 50, 27, 8, 7)
)

Second df:

df2 <- data.frame(       country = c("01",  "02", "03", "04", "05" ),
                  match_country1 = c("02",  "03", "01", "01", "01"), 
                  match_country2 = c("03",  "04", "02", "02", "03"), 
                  match_country3 = c("05",  "05", "04", "03", "04")
)

Now I need to compute a new_value that is an average of three values as defined in df2. I need to respect a date in df1. For example, new_value for country "01" and date "2020-01-01" is an average of a value from country "02", country "03", country "05" all from date "2020-01-01".

Desired output is below:

new_df <- data.frame(country = c("01", "01", "02", "03", "03", "03" , "04", "05"),
                  date = c("2020-01-01", "2020-01-02", "2020-01-02", "2020-01-02", "2020-01-03", "2020-01-04", "2020-01-01", "2020-01-02"),
                  value = c(4, 3, 2, -3, 1.5, 12, 10, 15),
                  blabla = c(23, 41, 32, 8, 50, 27, 8, 7),
                  new_value = c(NA, #because no data for 02, 03, 05 from 2020-01-01
                                (2-3+15)/3,
                                (-3+15)/2, #because no data for 04 from 2020-01-02
                                (3+2)/2, #because no data for 04 from 2020-01-02
                                NA,  #because no data for 01, 02, 04 from 2020-01-03
                                NA,  #because no data for 01, 02, 04 from 2020-01-04
                                4, #because no data for 02, 03 from 2020-01-01
                                (3-3)/2 #because no data for 04 from 2020-01-02
  )
)

How can I do this, please?

Upvotes: 1

Views: 76

Answers (3)

G. Grothendieck
G. Grothendieck

Reputation: 269754

This can be done using an SQL triple join. For each row in df1 get the matching country row in df2 via left join and then get all the rows in the b instance of df1 for which the date is the same and there is a country match in df2. Then take the average b value in the matching rows.

library(sqldf)
sqldf("select a.*, avg(b.value) new_value
  from df1 a
  left join df2 c on a.country = c.country
  left join df1 b on a.date = b.date and 
    b.country in (c.match_country1, c.match_country2, c.match_country3)
  group by a.rowid")

giving this data frame:

  country       date value blabla new_value
1      01 2020-01-01   4.0     23        NA
2      01 2020-01-02   3.0     41  4.666667
3      02 2020-01-02   2.0     32  6.000000
4      03 2020-01-02  -3.0      8  2.500000
5      03 2020-01-03   1.5     50        NA
6      03 2020-01-04  12.0     27        NA
7      04 2020-01-01  10.0      8  4.000000
8      05 2020-01-02  15.0      7  0.000000

Variations

Here are two variations. The first generates the in (...) string as matches and substitutes it in and the second converts df2 to long form, long first.

matches <- toString(names(df2)[-1])
fn$sqldf("select a.*, avg(b.value) new_value
  from df1 a
  left join df2 c on a.country = c.country
  left join df1 b on a.date = b.date and b.country in ($matches)
  group by a.rowid")


varying <- list(match_country = names(df2)[-1])
long <- reshape(df2, dir = "long", varying = varying, v.names = names(varying))
sqldf("select a.*, avg(b.value) new_value
  from df1 a
  left join long c on a.country = c.country
  left join df1 b on a.date = b.date and b.country = c.match_country
  group by a.rowid")

Upvotes: 2

Rui Barradas
Rui Barradas

Reputation: 76470

Though there already is an accepted answer, here is a base R, since the two answers posted (2nd) require external packages.

df1$new_value <- with(df1, ave(seq_len(n), date, FUN = function(i){
  mrg <- merge(df1[i, ], df2)
  j <- grep("^match", names(mrg))
  ctry <- unique(df1[i, "country"])
  apply(mrg[j], 1, function(row){
    k <- match(row, ctry)
    if(any(!is.na(k)))
      mean(mrg[k, "value"], na.rm = TRUE)
    else NA_real_
  })
}))

identical(df1$new_value, new_df$new_value)
#[1] TRUE

Upvotes: 1

AnilGoyal
AnilGoyal

Reputation: 26218

This tidyverse approach may help

df1
#>   country       date value blabla
#> 1      01 2020-01-01   4.0     23
#> 2      01 2020-01-02   3.0     41
#> 3      02 2020-01-02   2.0     32
#> 4      03 2020-01-02  -3.0      8
#> 5      03 2020-01-03   1.5     50
#> 6      03 2020-01-04  12.0     27
#> 7      04 2020-01-01  10.0      8
#> 8      05 2020-01-02  15.0      7

df2
#>   country match_country1 match_country2 match_country3
#> 1      01             02             03             05
#> 2      02             03             04             05
#> 3      03             01             02             04
#> 4      04             01             02             03
#> 5      05             01             03             04

suppressMessages(library(tidyverse))

df1 %>% 
  left_join(df2, by = 'country') %>% 
  nest(data = !date) %>%
  mutate(data = map(data, ~.x %>%
                      mutate(across(contains('match'), ~value[match(., country)])) %>%
                      rowwise() %>%
                      mutate(avg = mean(c_across(contains('match')), na.rm = T)) %>%
                      select(!contains('match'))
                    )
         ) %>%
  unnest(data)
#> # A tibble: 8 x 5
#>   date       country value blabla    avg
#>   <chr>      <chr>   <dbl>  <dbl>  <dbl>
#> 1 2020-01-01 01        4       23 NaN   
#> 2 2020-01-01 04       10        8   4   
#> 3 2020-01-02 01        3       41   4.67
#> 4 2020-01-02 02        2       32   6   
#> 5 2020-01-02 03       -3        8   2.5 
#> 6 2020-01-02 05       15        7   0   
#> 7 2020-01-03 03        1.5     50 NaN   
#> 8 2020-01-04 03       12       27 NaN

Created on 2021-05-02 by the reprex package (v2.0.0)

Upvotes: 2

Related Questions