Reputation: 151
I need to calculate a weighted median age by 3 variables (gender, sector and occupation). I know how to use "spatstat" weighted.median. But it gives the weighted age of the entire DF. I don't know hot to make it calculate it by groups/variables. I found another similar post from 2 years ago but i did not understand the solution offered in it. I did understand that you would need to use dplyr to group it. I have very limited understanding of the library. I would need someone to show me the code for the following DF so i can use it and do small changes to it if needed in the future.
My DF is as follows: The original df is much larger and has many more sectors and occupations. But the structure is the same.
Gender Sector occupation Age Weight M A 1 45 13.12 F C 1 23 7.05 M A 3 67 34.12 M A 3 23 11.00 F C 1 33 17.45 F A 1 35 45.01 F C 3 57 21.45 M C 1 32 23.33 F A 3 48 44.54 M A 1 25 13.12 F C 1 27 7.05 M A 3 37 34.12 M A 3 26 11.00 F C 1 73 17.45 F A 1 49 45.01 F C 3 24 21.45 M C 1 25 23.33 F A 3 35 44.54 M A 1 48 13.12 F C 1 83 7.05 M A 3 66 34.12 M A 3 43 11.00 F C 1 36 17.45 F A 1 35 45.01 F C 3 57 21.45 M C 1 22 23.33 F A 3 47 44.54 M A 1 55 13.12 F C 1 28 7.05 M A 3 27 34.12 M A 3 25 11.00 F C 1 23 17.45 F A 1 46 45.01 F C 3 34 21.45 M C 1 27 23.33 F A 3 65 44.54
This is the result I need. NB! The median age in here is fake. It just represents the structure of the end result.
Gender Sector occupation Median age M A 1 23 M A 3 34 M C 1 45 M C 3 23 F A 1 34 F A 3 45 F C 1 43 F C 3 34
The question I found (mentioned above) I did not understand it because i do not understand mutate and the data in the example is presented in a way that i can't read. I'm new to R
Upvotes: 1
Views: 1559
Reputation: 887711
There is a weighted.median
from spatstat
package. We could use that along with tidyverse
approach i.e. grouped by 'Gender', 'Sector' and 'occupation', apply the weighted.median
with x
as 'Age' and w
as 'Weight' in summarise
.
In the linked post, it is using mutate
as it is creating a new column while summarise
returns the summarised output i.e. one row for each group.
In addition, we should be cautious when using summarise/mutate
as those functions are found in plyr
as well and it could mask the dplyr
functions if both packages are loaded i.e. if the masked function from plyr
was used, then instead of grouping, it may return the weighted.median
of the whole data. So, either load only dplyr
or use explicitly dplyr::summarise
library(dplyr)
library(spatstat)
df1 %>%
group_by(Gender, Sector, occupation) %>%
dplyr::summarise(MedianAge = weighted.median(Age, Weight),
.groups = 'drop')
-output
# A tibble: 7 x 4
Gender Sector occupation MedianAge
<chr> <chr> <int> <dbl>
1 F A 1 35
2 F A 3 47
3 F C 1 31.0
4 F C 3 34
5 M A 1 45
6 M A 3 33.8
7 M C 1 25
df1 <- structure(list(Gender = c("M", "F", "M", "M", "F", "F", "F",
"M", "F", "M", "F", "M", "M", "F", "F", "F", "M", "F", "M", "F",
"M", "M", "F", "F", "F", "M", "F", "M", "F", "M", "M", "F", "F",
"F", "M", "F"), Sector = c("A", "C", "A", "A", "C", "A", "C",
"C", "A", "A", "C", "A", "A", "C", "A", "C", "C", "A", "A", "C",
"A", "A", "C", "A", "C", "C", "A", "A", "C", "A", "A", "C", "A",
"C", "C", "A"), occupation = c(1L, 1L, 3L, 3L, 1L, 1L, 3L, 1L,
3L, 1L, 1L, 3L, 3L, 1L, 1L, 3L, 1L, 3L, 1L, 1L, 3L, 3L, 1L, 1L,
3L, 1L, 3L, 1L, 1L, 3L, 3L, 1L, 1L, 3L, 1L, 3L), Age = c(45L,
23L, 67L, 23L, 33L, 35L, 57L, 32L, 48L, 25L, 27L, 37L, 26L, 73L,
49L, 24L, 25L, 35L, 48L, 83L, 66L, 43L, 36L, 35L, 57L, 22L, 47L,
55L, 28L, 27L, 25L, 23L, 46L, 34L, 27L, 65L), Weight = c(13.12,
7.05, 34.12, 11, 17.45, 45.01, 21.45, 23.33, 44.54, 13.12, 7.05,
34.12, 11, 17.45, 45.01, 21.45, 23.33, 44.54, 13.12, 7.05, 34.12,
11, 17.45, 45.01, 21.45, 23.33, 44.54, 13.12, 7.05, 34.12, 11,
17.45, 45.01, 21.45, 23.33, 44.54)), class = "data.frame", row.names = c(NA,
-36L))
Upvotes: 4