Reputation: 405
I'm struggling with a dataset where I have to identify certain traits from each family group. The following is a toy dataset. In the dataset, I can identify the person ID, family ID, which person is the head household for each family (HH), the age, income, and if the person has a disability or not:
Person Family HH AGE Income Disable
1 1 1 30 1000 0
2 1 0 28 1990 0
3 1 0 5 2000 0
4 2 1 20 2300 0
5 3 0 45 0 1
6 3 1 51 3200 0
7 3 0 23 1000 0
8 4 0 22 0 1
9 4 1 22 1200 0
10 5 1 65 2220 0
The objective is to create something like this:
Family Age_HH Family_income_percapita Disable
1 30 1663.3 0
2 20 2300 0
3 51 1400 1
4 22 600 1
5 65 2220 0
My try was to do something like this:
dt %>%
group_by(Family) %>%
summarise(
Age_HH = case_when(HH == 1 ~ AGE, TRUE ~ 0), # Head household age
Family_income_percapita = sum(Income)/n(), # Income per-capita
Disable = max(Disable) # If there is any family member with a disability
)
The original dataset is huge, so I don't know how to check if my method is right by hand. I wonder if there is a better way to compute Age_HH
. Thank you!!
Upvotes: 2
Views: 94
Reputation: 184
It looks like the problem with your code is that you are trying to do everything inside of summarise in an aggregated fashion, when some of the calculations actually could be performed with a mutate, either before or after being aggregated. The following code seems to work with your toy data. Let me know if that helps.
library(tidyverse)
dt <- tibble::tribble(
~person, ~family, ~hh, ~age, ~income, ~disable,
1, 1, 1, 30, 1000, 0,
2, 1, 0, 28, 1990, 0,
3, 1, 0, 5, 2000, 0,
4, 2, 1, 20, 2300, 0,
5, 3, 0, 45, 0, 1,
6, 3, 1, 51, 3200, 0,
7, 3, 0, 23, 1000, 0,
8, 4, 0, 22, 0, 1,
9, 4, 1, 22, 1200, 0,
0, 5, 1, 65, 2220, 0
)
dt %>%
mutate(age_hh = if_else(hh == 1, age, NaN)) %>%
group_by(family) %>%
summarise(
age_hh = mean(age_hh, na.rm = TRUE),
family_income = sum(income, na.rm = TRUE),
family_count = n(),
disable = max(disable, na.rm = TRUE)
) %>%
transmute(
family,
age_hh,
family_income_per_capita = family_income/family_count,
disable
)
# # A tibble: 5 x 4
# family age_hh family_income_per_capita disable
# <dbl> <dbl> <dbl> <dbl>
# 1 1 30 1663. 0
# 2 2 20 2300 0
# 3 3 51 1400 1
# 4 4 22 600 1
# 5 5 65 2220 0
Upvotes: 1
Reputation: 10776
case_when(HH == 1 ~ AGE, TRUE ~ 0)
doesn't aggregate anything yet.
You could aggregate this result though.
You can more succinctly do this operation by subscripting within groups.
If you know you have exactly one HH per family you can do something like Age_HH = Age[HH == 1]
.
If you have no head the family won't appear, if you have multiple it'll appear multiple times so if you have 0 or more than one you can aggregate the subscripted fields more safely like Age_HH = max(Age[HH == 1])
.
Upvotes: 2