R using dplyr to compute weighed statistics by group

Question

For some years now I've been using the Hmisc package and base R to compute weighted statistical summaries. Typically, I use a double weight with one being a spatial influence weight and the other a data support value such as length, volume, physical density and so on. Using the 'mtcars' dataset as an example where mpg is the variable of interest and the double weight contrived from car 'wt' and 'hp', the Hmisc + base R workflow is typically like the one below.

require(Hmisc)

mtcars$Wt2 <- mtcars$wt * mtcars$hp               # double weight
mtcars$Acc <- mtcars$Wt2 * mtcars$mpg             # accumulation

min(mtcars$mpg)                                   # min
sqrt(wtd.var(mtcars$mpg, mtcars$mpg))             # wtd sd
wtd.quantile(mtcars$mpg, mtcars$mpg,0.05)         # wtd 5th 
wtd.quantile(mtcars$mpg, mtcars$mpg,0.50)         # wtd median
wtd.quantile(mtcars$mpg, mtcars$mpg,0.95)         # wtd 95th
max(mtcar$mpg)                                    # max

Using a loop it is then possible to filter and compute these weighted statistics for each zone of interest in a larger data frame. However, as I'm attempting to learn how to use dplyr I wondering how to compute these weighted statistics. A weighted mean option is available but the others will require a bit more work. Below is the code where I've worked out how to compute the weighted mean from first principles(and checked this against the dplyr inbuilt function). However, I'm stumped as tp how to proceed to compute the weighted SD and quantiles by group using dplyr as I somehow need to get the squared weighted mean differences (for each group) into the chain of pipes.

mtcars %>% 
  mutate(Car = row.names(mtcars)) %>%  # variable for car name
  mutate(Wt2 = wt * hp) %>%            # double weight
  mutate(Acc = Wt2 * mpg) %>%          # weighted consumption
  group_by(Car) %>%                    # group by car type
  summarise(n = n(),
            SmWt2 = sum(Wt2),                    # Sum of double weight
            SmAcc = sum(Acc),                    # Sum of accumulations
            WtMn = SmAcc/SmWt2,                  # Weighted mean
            WtMnChk = weighted.mean(mpg, Wt2)    # Check weighted mean
            )

Jon Spring · Accepted Answer

I'm not sure I understand exactly the approach you're working on, but here's an example of finding the weighted average and weighted standard deviation by gear, using wt as the weighting:

library(dplyr)
datasets::mtcars %>% 
  group_by(gear) %>%
  summarize(n = n(),
            mpg_weighted_by_weight = sum(mpg*wt) / sum(wt),
            mpg_weighted_by_weight_check = weighted.mean(mpg, wt),
            
            mpg_sd = sqrt(sum(wt * ((mpg - mpg_weighted_by_weight)^2))/(sum(wt)-1)),
            mpg_sd_check = sqrt(Hmisc::wtd.var(mpg, wt)))


# A tibble: 3 x 6
   gear     n mpg_weighted_by_weight mpg_weighted_by_weight_check mpg_sd mpg_sd_check
*                                                      
1     3    15                   15.6                         15.6   3.32         3.32
2     4    12                   23.6                         23.6   4.81         4.81
3     5     5                   19.7                         19.7   5.63         5.63

I wasn't familiar with the formula for weighted standard deviation, but rather cheated and relied on the formula from Hmisc::wtd.var. If you control-click on the formula name in RStudio, it shows the underlying code of the function. Most of it is error handling until the bottom:

#Hmisc::wtd.var
function (x, weights = NULL, normwt = FALSE, na.rm = TRUE, method = c("unbiased", 
  "ML")) 
{
  # ...  skipping error handling
  sw <- sum(weights)
  # ...
  xbar <- sum(weights * x)/sw
  sum(weights * ((x - xbar)^2))/(sw - 1)
}

R using dplyr to compute weighed statistics by group

Answers (1)

Related Questions