Reputation: 1440
For some years now I've been using the Hmisc package and base R to compute weighted statistical summaries. Typically, I use a double weight with one being a spatial influence weight and the other a data support value such as length, volume, physical density and so on. Using the 'mtcars' dataset as an example where mpg is the variable of interest and the double weight contrived from car 'wt' and 'hp', the Hmisc + base R workflow is typically like the one below.
require(Hmisc)
mtcars$Wt2 <- mtcars$wt * mtcars$hp # double weight
mtcars$Acc <- mtcars$Wt2 * mtcars$mpg # accumulation
min(mtcars$mpg) # min
sqrt(wtd.var(mtcars$mpg, mtcars$mpg)) # wtd sd
wtd.quantile(mtcars$mpg, mtcars$mpg,0.05) # wtd 5th
wtd.quantile(mtcars$mpg, mtcars$mpg,0.50) # wtd median
wtd.quantile(mtcars$mpg, mtcars$mpg,0.95) # wtd 95th
max(mtcar$mpg) # max
Using a loop it is then possible to filter and compute these weighted statistics for each zone of interest in a larger data frame. However, as I'm attempting to learn how to use dplyr I wondering how to compute these weighted statistics. A weighted mean option is available but the others will require a bit more work. Below is the code where I've worked out how to compute the weighted mean from first principles(and checked this against the dplyr inbuilt function). However, I'm stumped as tp how to proceed to compute the weighted SD and quantiles by group using dplyr as I somehow need to get the squared weighted mean differences (for each group) into the chain of pipes.
mtcars %>%
mutate(Car = row.names(mtcars)) %>% # variable for car name
mutate(Wt2 = wt * hp) %>% # double weight
mutate(Acc = Wt2 * mpg) %>% # weighted consumption
group_by(Car) %>% # group by car type
summarise(n = n(),
SmWt2 = sum(Wt2), # Sum of double weight
SmAcc = sum(Acc), # Sum of accumulations
WtMn = SmAcc/SmWt2, # Weighted mean
WtMnChk = weighted.mean(mpg, Wt2) # Check weighted mean
)
Upvotes: 4
Views: 2962
Reputation: 66490
I'm not sure I understand exactly the approach you're working on, but here's an example of finding the weighted average and weighted standard deviation by gear
, using wt
as the weighting:
library(dplyr)
datasets::mtcars %>%
group_by(gear) %>%
summarize(n = n(),
mpg_weighted_by_weight = sum(mpg*wt) / sum(wt),
mpg_weighted_by_weight_check = weighted.mean(mpg, wt),
mpg_sd = sqrt(sum(wt * ((mpg - mpg_weighted_by_weight)^2))/(sum(wt)-1)),
mpg_sd_check = sqrt(Hmisc::wtd.var(mpg, wt)))
# A tibble: 3 x 6
gear n mpg_weighted_by_weight mpg_weighted_by_weight_check mpg_sd mpg_sd_check
* <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 3 15 15.6 15.6 3.32 3.32
2 4 12 23.6 23.6 4.81 4.81
3 5 5 19.7 19.7 5.63 5.63
I wasn't familiar with the formula for weighted standard deviation, but rather cheated and relied on the formula from Hmisc::wtd.var
. If you control-click on the formula name in RStudio, it shows the underlying code of the function. Most of it is error handling until the bottom:
#Hmisc::wtd.var
function (x, weights = NULL, normwt = FALSE, na.rm = TRUE, method = c("unbiased",
"ML"))
{
# ... skipping error handling
sw <- sum(weights)
# ...
xbar <- sum(weights * x)/sw
sum(weights * ((x - xbar)^2))/(sw - 1)
}
Upvotes: 3