Reputation: 131
I have a dataset with approximately 1.5 million individual who can be distinguished in their household with “Household ID”. The dataset has a column (relationship) that specifies the relationship between people in the household based on the relationship they have with the head of household. “code 1” refers to the head of the household; “code 2” is the wife of head; “code 3” is the children of head; “code 4” is daughter in low and son in low (if the gender code is 1 it refers to son in low and 2 refers to daughter in law); “code 5” refers to grandchildren of the head. “Code 6” refers to the mother or father of the head based on their gender; “Code 7” is a brother or sister of head, here again, based on their gender; “Code 8” refers to other relatives, and “Code 9” refers to others in the household which are not considered as relative (nonrelative). Following I bring a small sample of my dataset:
```library(tidyverse)
sample <- tibble(
household.ID = c(12850540085,12850540085,12850540085,12850540085,11854478630,11854478630,11854478630,11854478630,11854478630,243851633,243851633,243851633,243851633,243851633,243851633 ,243851633,243851633,243851633,243851633,16523634,16523634,16523634,16523634,16525912,16525912,16540127,16540127,16598050,16598050,16611764,16611764,16611764,16643309,16643309,16652356,16652356,16652356,16672105,16672105,16672105,16672105, 14347606936,14347606936,14513544764,14513544764,14513544764,14513544764,14513544764 ),
member.ID= c(12850560156, 12850560381, 12850560592,12850560795,23529759,23529760,23529761,23529762 ,23529763,243936326,243946420 ,243953196 ,243963749,243969681 ,243976774,243982427,243988959 ,243995522 ,244060815 ,16527193, 16529443, 16532250, 16534992,16527527, 16529230,16542499,16545263, 16616975, 16620223, 16633984,16642611,16650837,16646986,16650210, 16660335, 16665128,16668381,16676674, 16681528, 16685073,16687491, 14347619183,14347619282,14513560002,14513560086,14513560144,14513560214,14513560291),
relationship = c(1,2,3,5,1,9,3,6,9,1,2,3,3,3,6,9,7,8,8,1,2,3,3,1,2,1,2,1,3,1,2,3,1,2,1,2,3,1,2,3,3,1,7,1,2,4,5,5),
gender = c(1,2,2,1,1,2 ,2 ,2,1,1,2,2 ,2,2,1,2,1,2,2,1,2,1,2,1,2,1,2,1,1,1,2,1,1,2,1,2,2,1,2,1,1,1,1,1,2,2,1,1),
age = c(64,58,38,6,42,42,1,76,19,61,59,35,33,29,96,28,45,43,15,49,47,19,18,38,28,78,75,66,21,56,52,28,51,58,74,68,27,56,43,23,13,45,42, 65 ,58 ,29,4 ,0),
marriage.status= c(1,1,1,4,1,1,4,2,1,1,1,4,1,1,2,4,1,1,4,1,1,4,4,1,1,1,1,2,4,1,1,1,1,1,1,1,4,1,1,4,4,4,4,1,1,2,4,4),
children.ever.born= c(NA,7,3,NA,NA,2,NA,9,NA,NA,3,NA,NA,NA,NA,NA,NA,1,NA ,NA,2,NA,NA,NA,NA,NA,6,NA,NA,NA,2,NA,NA,3,NA,3,NA,NA,2,NA,NA,NA,NA,NA,8,2,NA,NA),
living.children = c(NA,5,3,NA,NA,2,NA,6,NA,NA,3,NA,NA,NA,NA,NA,NA,1,NA,NA,2,NA,NA,NA,NA,NA,4,NA,NA,NA,2,NA,NA,3,NA,3,NA,NA,2,NA,NA,NA,NA,NA,4,2,NA,NA ))```
Although it seems cruel, I want to link the abovementioned codes to link children with their mothers. For example, people in “code 2” are the mothers of “code 3”. Or if the head of household is a woman (gender 2), again the “code 3” is their children, and if “code 3” with gender (2) is the mothers of “code 4”, but there is an exception here. If you look at the data, in some households we can see that there is no “code 4” in the household, but we have “code 5” in the household. Here, those grandchildren are related to the “code 3” who have children and live in their parents’ household. (Table below)
“Code 6” with gender (2) are the mothers of “code 7” and the head of the household. There is a challenge here about other relatives with “code 8” and nonrelatives with “code 9”. Mothers specify with these codes(8 and 9), do not have code to refer to their children, but when there is a person within the household with “code 8 and 9” and she is married and has a child based on column “living. children”, any other child with similar codes in the household, is the child of the women based on proximity in the household. To illustrate, consider the following table which is one of the households in the dataset with a similar household. ID.
Our procedure would be to keep looping through each household looking for those relationship/gender pairings that I described, starting with the unambiguous ones (like child-mother) and ending with the ones for which there could be more than one potential mother (e.g., grandchild). For instance, row 1, based on the relationship column, is “code 1: which is the head of household and is a man. "code 2" is a woman and the spouse of the head and based on the “children ever born” column, she has had 3 children so far, and all of them are present in the household, which is not always the case. Code 3 is the child of code 2. Code 6 is the father of the head based on his gender (1). Code 9 is nonrelative which is alone in the household. Code 7 is the brother of the head based on his gender. Code 8 here needs more explanation. We have two individuals with code 8 within the household, both of them are female, and the first one has one child based on the “living children” column; therefore, we can conclude that the second individual with code 8 is the daughter of the first one. This linkage is logical based on the mother’s age. Finally, I want an output like the following table with a new column that is filled with the member.ID of the individuals’ mothers if existed.
Upvotes: 0
Views: 51
Reputation: 66880
Here's a suggestion for a generalized approach. Define the links you want to detect, put those in tables, and then join your data to those tables. For instance, we can make a table with the 2's who match with the 3's in each household:
sample2 <- select(sample, 1:3) # we only need the first 3 columns
mother_links <- left_join(
sample2 %>% filter(relationship == 3),
sample2 %>% filter(relationship == 2) %>%
rename(member.ID.mother = member.ID), join_by(household.ID)) %>%
filter(!is.na(relationship.y)) %>%
select(1:2, member.ID.mother)
So here we have the links for those people:
> mother_links
# A tibble: 2 × 3
household.ID member.ID member.ID.mother
<dbl> <dbl> <dbl>
1 243851633 243946420 243995522
2 243851633 243946420 244060815
As for the 2nd link that's described, looking at the 8's, linking the women with children to the ones without children:
sample3 <- select(sample, 1:3, gender, age, living.children)
other_links <- left_join(
sample3%>% filter(relationship == 8, is.na(living.children)),
sample3 %>%
filter(relationship == 8, gender == 2, living.children > 0) %>%
rename(member.ID.mother = member.ID),
join_by(household.ID)
) %>%
select(1:2, member.ID.mother)
> other_links
# A tibble: 1 × 3
household.ID member.ID member.ID.mother
<dbl> <dbl> <dbl>
1 243851633 244060815 243995522
Then we can join them to the original data to see results like in your example:
sample %>%
left_join(bind_rows(mother_links, other_links)) %>%
filter(household.ID == 243851633)
Joining with `by = join_by(household.ID, member.ID)`
# A tibble: 10 × 9
household.ID member.ID relationship gender age marriage.status children.ever.born living.children member.ID.mother
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 243851633 243936326 1 1 61 1 NA NA NA
2 243851633 243946420 2 2 59 1 3 3 NA
3 243851633 243953196 3 2 35 4 NA NA 243946420
4 243851633 243963749 3 2 33 1 NA NA 243946420
5 243851633 243969681 3 2 29 1 NA NA 243946420
6 243851633 243976774 6 1 96 2 NA NA NA
7 243851633 243982427 9 2 28 4 NA NA NA
8 243851633 243988959 7 1 45 1 NA NA NA
9 243851633 243995522 8 2 43 1 1 1 NA
10 243851633 244060815 8 2 15 4 NA NA 243995522
Upvotes: 2