X.Jun
X.Jun

Reputation: 390

How to calculate the number of children for each female in a dataset in R?

I'm working with a dataset in R and I'm trying to calculate the number of children for each female individual based on their relationship to the household head. The dataset includes variables such as Household ID, Individual ID, Relationship to the household head, Age, Gender, and Income.

 HouseholdID IndividualID Relationshiptothehouseholdhead  Age Gender  Income
    <dbl>  <dbl> <chr>        <dbl> <chr> <dbl>
 1      1      1 C               80 male      150
 2      1      2 D               81 female      120
 3      1      3 A               60 male      630
 4      1      4 B               59 female      500
 5      1      5 E3              35 male      380
 6      1      6 F3              30 female      220
 7      1      7 E5              33 female      170
 8      1      8 F5              30 male      160
 9      1      9 G32             20 female      290
10      1     10 G51             15 female      200
11      1     11 G52             12 female      100
12      1     12 G55              8 male       80
13      2      1 A               58 male      380
14      2      2 B               55 female      220
15      2      3 E1              35 male      170
16      2      4 F1              37 female      160
17      2      5 E2              33 male      290
18      2      6 F2              30 female      110
19      2      7 G21             17 female      210
20      2      8 G22             15 female      750
21      2      9 G23             12 female      350

The data structure provided in the table includes the following variables:

Household ID: This is a unique identifier for a family household.
Individual ID: This is a unique number assigned to each individual within the household.
Relationship to the household head: Specific symbols are used to represent the relationship of an individual to the head of the household.

Age: The age of the individual.
Gender: The gender of the individual, represented by "male" or "female".
Income: The income situation of the individual.

Please generate a dataset similar to Table 2 based on the data from Table 1 with the following requirements:

  1. Only include female individuals.
  2. Calculate the number of children each female has given birth to.

It's important to note that the number of children is primarily determined by the highest number following the letter, rather than simply counting the number of observations in the data. For example, in household 1, the individual with ID equal to 4 should be considered as having 5 children, not 2.

The result should be as follows:

HouseholdID IndividualID  Age Gender  Income  Numofkids
1 2 81  female 120 1
1 4 59  female 500 5
1 6 30  female 220 2
1 7 33  female 170 3
1 9 35  female 290 0
1 10  15  female 200 0
1 11  12  female 100 0
2 2 55  female 220 2
2 4 37  female 160 0
2 6 30  female 110 3
2 7 17  female 210 0
2 8 15  female 750 0
2 9 12  female 350 0

Here is the data

data = structure(list(HouseholdID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2), IndividualID = c(1, 2, 3, 4, 
5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9), Relationshiptothehouseholdhead = c("C", 
"D", "A", "B", "E3", "F3", "E5", "F5", "G32", "G51", "G52", "G55", 
"A", "B", "E1", "F1", "E2", "F2", "G21", "G22", "G23"), Age = c(80, 
81, 60, 59, 35, 30, 33, 30, 20, 15, 12, 8, 58, 55, 35, 37, 33, 
30, 17, 15, 12), Gender = c("male", "female", "male", "female", 
"male", "female", "female", "male", "female", "female", "female", 
"male", "male", "female", "male", "female", "male", "female", 
"female", "female", "female"), Income = c(150, 120, 630, 500, 
380, 220, 170, 160, 290, 200, 100, 80, 380, 220, 170, 160, 290, 
110, 210, 750, 350)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -21L))

Thanks!

Upvotes: 1

Views: 169

Answers (2)

X.Jun
X.Jun

Reputation: 390

I have translated the aforementioned code into the format of the dplyr package for future reference.

library(dplyr)
data <- data %>%
  mutate(hh_rel = substr(Relationshiptothehouseholdhead, 1, 1),
         gen2_idx = as.integer(substr(Relationshiptothehouseholdhead, 2, 2)),
         gen3_idx = as.integer(substr(Relationshiptothehouseholdhead, 3, 3)))

gen2 <- data %>%
  group_by(HouseholdID) %>%
  summarise(gen2_max = max(gen2_idx, na.rm = TRUE))

gen3 <- data %>%
  filter(!is.na(gen2_idx)) %>%
  group_by(HouseholdID, gen2_idx) %>%
  summarise(gen3_max = max(gen3_idx, na.rm = TRUE))

out <- data %>%
  filter(Gender == "female") %>%
  left_join(gen2, by = "HouseholdID") %>%
  left_join(gen3, by = c("HouseholdID", "gen2_idx")) %>%
  mutate(Numofkids = case_when(
    hh_rel == "D" ~ 1L,
    hh_rel == "A" | hh_rel == "B" ~ coalesce(gen2_max, 0L),
    hh_rel == "E" | hh_rel == "F" ~ coalesce(gen3_max, 0L),
    TRUE ~ 0L
  )) %>%
  select(-hh_rel, -gen2_idx, -gen3_idx)

Upvotes: 0

Tobo
Tobo

Reputation: 754

From a coding perspective this question is clear, even if the exercise itself involves simplifications and assumptions. However, I've assumed your comment about using the numbering rather than the counts extends to the household head's children's children, and so I get a different answer from you for Household 1, Individual 7. They are E5, and mother of G51, G52, G55, which implies 5 children not 3.

library(data.table)
dt <- as.data.table(data)

# split the household relationship column into 3 for ease
dt[, let(hh_rel   = substr(Relationshiptothehouseholdhead,1,1),
         gen2_idx = as.integer(substr(Relationshiptothehouseholdhead,2,2)),
         gen3_idx = as.integer(substr(Relationshiptothehouseholdhead,3,3)))]

# family tree lookup tables (household head is gen1)
gen2 <- dt[,
           .(gen2_max = max(gen2_idx, na.rm=TRUE)),
           by=HouseholdID]

gen3 <- dt[!is.na( gen2_idx),
           .(gen3_max = max(gen3_idx, na.rm=TRUE)),
           by=.(HouseholdID, gen2_idx)]

This gives the calculated no. of children born to the head of each household (generation 2), and the number of children (generation 3) born to those of gen 2 who are in the household (or whose spouses are):

> gen2
   HouseholdID gen2_max
         <num>    <int>
1:           1        5
2:           2        2

> gen3
   HouseholdID gen2_idx gen3_max
         <num>    <int>    <int>
1:           1        3        2
2:           1        5        5
3:           2        1       NA
4:           2        2        3
   

Construct the output from these lookups:

# women
out <- dt[Gender == "female",
          .(HouseholdID,
            IndividualID,
            Age,
            Gender,
            Income,
            hh_rel,
            gen2_idx)]

# if mother of HH, count 1 child
out[hh_rel == "D", Numofkids := 1L]

# if HH or wife of HH, highest 2nd gen index
out[hh_rel %in% c("A", "B"), Numofkids := gen2[.SD, on=.(HouseholdID), gen2_max]]

# if daughter or daughter-in-law of HH, highest 3rd gen index among own 2nd gen index
out[hh_rel %in% c("E", "F"), Numofkids := gen3[.SD, on=.(HouseholdID, gen2_idx), gen3_max]]

# otherwise 0
out[is.na(Numofkids), Numofkids := 0L]

# drop cols
out[, let(hh_rel = NULL, gen2_idx = NULL)]

Output:

> out
    HouseholdID IndividualID   Age Gender Income Numofkids
          <num>        <num> <num> <char>  <num>     <int>
 1:           1            2    81 female    120         1
 2:           1            4    59 female    500         5
 3:           1            6    30 female    220         2
 4:           1            7    33 female    170         5
 5:           1            9    20 female    290         0
 6:           1           10    15 female    200         0
 7:           1           11    12 female    100         0
 8:           2            2    55 female    220         2
 9:           2            4    37 female    160         0
10:           2            6    30 female    110         3
11:           2            7    17 female    210         0
12:           2            8    15 female    750         0
13:           2            9    12 female    350         0

Upvotes: 1

Related Questions