Nick
Nick

Reputation: 145

weighted.mean inside aggregate across 2 vectors in R?

I have a dataframe with vectors Latitude, Longitude, Period, and ID. I would like to calculate the positional centroid for each period (n = 2), weighted by the number of observations for each unique ID, so that IDs with fewer observations receive higher weights than those with more observations.

My dataframe is 300,000 obs but looks something like this:

dat <- data.frame(Latitude = c(35.8, 35.85, 36.7, 35.2, 36.1, 35.859, 36.0, 37.0, 35.1, 35.2),
                  Longitude = c(-89.4, -89.5, -89.4, -89.8, -90, -89.63, -89.7, -89, -88.9, -89),
                  Period = c(early, early, early, early, early, late, late, late, late, late),
                  ID = c(A, A, A, B, C, C, C, D, E, E))

I can easily calculate the mean between early and late periods using aggregate... centroid <- aggregate(cbind(Longitude, Latitude) ~ Period, dat, mean) but is there a way to calculate the centroid for each period weighted by the number of observations for each ID so that those with more observations do not bias the mean? And, if possible, is there an elegant way of doing this inside the aggregate function or a dplyr solution also would be helpful.

Any assistance would be much appreciated. Best,

Nick

Upvotes: 1

Views: 357

Answers (1)

TimTeaFan
TimTeaFan

Reputation: 18561

If you want to calculate your own weights, based on the group Period and ID so that each ID has the same influence on the centeriods by Period then we just need to divide 1 through the number of observations in each Perdiod ID group. Below is the code using weighted.mean in dplyr::across.

library(dplyr)
dat %>% 
  group_by(Period, ID) %>% 
  mutate(weight = 1/n()) %>% 
  group_by(Period) %>% 
  summarise(across(c(Longitude, Latitude),
                   ~ weighted.mean(.x, w = weight)))

#> # A tibble: 2 x 3
#>   Period Longitude Latitude
#>   <chr>      <dbl>    <dbl>
#> 1 early      -89.7     35.8
#> 2 late       -89.2     36.0

# data
dat <- data.frame(Latitude = c(35.8, 35.85, 36.7, 35.2, 36.1, 35.859, 36.0, 37.0, 35.1, 35.2),
                  Longitude = c(-89.4, -89.5, -89.4, -89.8, -90, -89.63, -89.7, -89, -88.9, -89),
                  Period = rep(c("early", "late"), each = 5),
                  ID = c("A", "A", "A", "B", "C", "C", "C", "D", "E", "E"))

Created on 2021-08-26 by the reprex package (v0.3.0)

Upvotes: 1

Related Questions