Sylvain
Sylvain

Reputation: 143

Determine the weighted mean of different columns in a data frame

I have a data frame Mesure and I wish to determine for each row, a weighted mean like this :

weighted_mean = ((mean_Mesure x nbr_Mesure) + (mean_DL x nbr_DL)) / (nbr_Mesure + nbr_DL)

I know there is a weighted.mean function but I failed to obtain a new column "weighted_mean"

And, is it an issue if each row does not necessary have the 4 values to obtain this formula (such as row 6 in Mesure) ?

> head(Mesure)
         Row.names         mean_Mesure nbr_Mesure  mean_DL    nbr_DL
2    Aquatic_moss.BE-7     123            4         542        12
3   Aquatic_moss.CO-57     100            7         117        14         
4   Aquatic_moss.CO-58     120            5         145        12           
5   Aquatic_moss.CO-60     140            5         153        12 
6  Aquatic_moss.CS-134                              146        15       

Upvotes: 0

Views: 132

Answers (3)

kmacierzanka
kmacierzanka

Reputation: 825

You can use the rowwise() function in the new dplyr:

library(dplyr) # 1.0.0

Mesure %>%
        rowwise() %>%
        mutate(weighted.mean = ((mean_Mesure * nbr_Mesure) + (mean_DL * nbr_DL)) / (nbr_Mesure + nbr_DL))
# A tibble: 5 x 6
# Rowwise: 
  Row.names           mean_Mesure nbr_Mesure mean_DL nbr_DL weighted.mean
  <chr>                     <dbl>      <dbl>   <dbl>  <dbl>         <dbl>
1 Aquatic_moss.BE-7           123          4     542     12          437.
2 Aquatic_moss.CO-57          100          7     117     14          111.
3 Aquatic_moss.CO-58          120          5     145     12          138.
4 Aquatic_moss.CO-60          140          5     153     12          149.
5 Aquatic_moss.CS-134          NA         NA     146     15           NA 

EDIT

If we want to replace NAs with 0, then we can use the na_replace() function from tidyr:

library(dplyr)
library(tidyr) # 1.1.0

Mesure %>%
        replace_na(list(mean_Mesure = 0,
                        nbr_Mesure = 0,
                        mean_DL = 0,
                        nbr_DL = 0)) %>%
        rowwise() %>%
        mutate(weighted.mean = ((mean_Mesure * nbr_Mesure) + (mean_DL * nbr_DL)) / (nbr_Mesure + nbr_DL))
# A tibble: 5 x 6
# Rowwise: 
  Row.names           mean_Mesure nbr_Mesure mean_DL nbr_DL weighted.mean
  <chr>                     <dbl>      <dbl>   <dbl>  <dbl>         <dbl>
1 Aquatic_moss.BE-7           123          4     542     12          437.
2 Aquatic_moss.CO-57          100          7     117     14          111.
3 Aquatic_moss.CO-58          120          5     145     12          138.
4 Aquatic_moss.CO-60          140          5     153     12          149.
5 Aquatic_moss.CS-134           0          0     146     15          146 

DATA

Mesure <- structure(list(Row.names = c("Aquatic_moss.BE-7", "Aquatic_moss.CO-57", 
"Aquatic_moss.CO-58", "Aquatic_moss.CO-60", "Aquatic_moss.CS-134"
), mean_Mesure = c(123, 100, 120, 140, NA), nbr_Mesure = c(4, 
7, 5, 5, NA), mean_DL = c(542, 117, 145, 153, 146), nbr_DL = c(12, 
14, 12, 12, 15)), row.names = c(NA, -5L), class = c("tbl_df", 
"tbl", "data.frame"))

Upvotes: 1

slava-kohut
slava-kohut

Reputation: 4233

You can also use mapply. That way you can use a generic function and pass any columns to it:

df <- read.table(text = "
                         Row.names         mean_Mesure nbr_Mesure  mean_DL    nbr_DL
2 Aquatic_moss.BE-7     123            4         542        12
3 Aquatic_moss.CO-57     100            7         117        14         
4 Aquatic_moss.CO-58     120            5         145        12           
5 Aquatic_moss.CO-60     140            5         153        12 
6 Aquatic_moss.CS-134   NA            NA         146        15 ")


df$mean_Mesure[is.na(df$mean_Mesure)] <- 0
df$nbr_Mesure[is.na(df$nbr_Mesure)] <- 0

df$weighted.mean <- mapply(function(x1,x2,x3,x4) (x1*x2 + x3*x4)/(x2+x4), df$mean_Mesure, df$nbr_Mesure,  df$mean_DL, df$nbr_DL)

Output

Row.names mean_Mesure nbr_Mesure mean_DL nbr_DL weighted.mean
2   Aquatic_moss.BE-7         123          4     542     12      437.2500
3  Aquatic_moss.CO-57         100          7     117     14      111.3333
4  Aquatic_moss.CO-58         120          5     145     12      137.6471
5  Aquatic_moss.CO-60         140          5     153     12      149.1765
6 Aquatic_moss.CS-134           0          0     146     15      146.0000

Upvotes: 1

GKi
GKi

Reputation: 39737

In your case you can use your equation like it is to get weighted means per row like:

with(Mesure, ((mean_Mesure * nbr_Mesure) + (mean_DL * nbr_DL)) / (nbr_Mesure + nbr_DL))
#[1] 437.2500 111.3333 137.6471 149.1765       NA

When there are missing values it will return NA. In case NA is 0 you can set it to 0:

Mesure[is.na(Mesure)] <- 0

what gives:

#[1] 437.2500 111.3333 137.6471 149.1765 146.0000

Upvotes: 1

Related Questions