giac
giac

Reputation: 4299

R - fill missing information based on paired data

I am trying to fill missing cases with information based a related individual (couple).

My data looks like this

   hserial    sex age     children
1  1001041   Male  30          Yes
2  1001041 Female  32          Yes
3  1001061   Male  22           No
4  1001061 Female  21           No
5  1001091   Male  38          Yes
6  1001091 Female  37          Yes
7  1001151   Male  31           No
8  1001151 Female  27 Not eligible
9  1001161   Male  33          Yes
10 1001161 Female  35          Yes

So hserial is the couple identifier. Row number 8 has a missing case Not eligible but the information is available from the partner (row number 7).

I am trying to find a neat way to fill these missing with partner's info.

I was thinking doing something like

library(dplyr) 

childsum = dta %>% group_by(hserial, sex, children) %>% 
summarise(n = n()) %>% spread(sex, children) 

I will get

  hserial n Male       Female
1 1001041 1  Yes          Yes
2 1001061 1   No           No
3 1001091 1  Yes          Yes
4 1001151 1   No Not eligible
5 1001161 1  Yes          Yes

Then I could do something like

childsum$Male = ifelse(childsum$Male == 'Not eligible', childsum$Female, childsum$Male)
childsum$Female = ifelse(childsum$Female == 'Not eligible', childsum$Male, childsum$Female)

So for every missing of Male fill with Female info and vice versa. Then merge back the results in order to get

   hserial    sex age     children
1  1001041   Male  30          Yes
2  1001041 Female  32          Yes
3  1001061   Male  22           No
4  1001061 Female  21           No
5  1001091   Male  38          Yes
6  1001091 Female  37          Yes
7  1001151   Male  31           No
8  1001151 Female  27           No
9  1001161   Male  33          Yes
10 1001161 Female  35          Yes

Any idea how to do this is a neat way ?

dta = structure(list(hserial = c(1001041, 1001041, 1001061, 1001061, 
1001091, 1001091, 1001151, 1001151, 1001161, 1001161), sex = structure(c(1L, 
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("Male", "Female"
), class = "factor"), age = c(30, 32, 22, 21, 38, 37, 31, 27, 
33, 35), children = structure(c(5L, 5L, 6L, 6L, 5L, 5L, 6L, 4L, 
5L, 5L), .Label = c("DNA Does not apply", "NA No answer", "NA No answer", 
"Not eligible", "Yes", "No"), class = "factor")), class = "data.frame", .Names = c("hserial", 
"sex", "age", "children"), row.names = c(NA, -10L))

Upvotes: 3

Views: 73

Answers (1)

talat
talat

Reputation: 70266

Here's an approach which assumes that any couple (consisting of two hserials) should always have the same yes/no entry in children unless both persons have Not eligible entries. Therefore, it computes per couple the setdiff of available children info and Not eligible. In cases where all (both) entries are "Not eligible", it returns NA, since I think that's a better way to handle missing values (as you know there are many specialized functions you can use with NAs that you cannot use the same way for Not eligible entries).

dta %>% 
  group_by(hserial) %>% 
  mutate(children = if(all(children == "Not eligible")) NA_character_ else 
                       setdiff(children, "Not eligible"))
#Source: local data frame [10 x 4]
#Groups: hserial [5]
#
#   hserial    sex   age children
#     (dbl) (fctr) (dbl)    (chr)
#1  1001041   Male    30      Yes
#2  1001041 Female    32      Yes
#3  1001061   Male    22       No
#4  1001061 Female    21       No
#5  1001091   Male    38      Yes
#6  1001091 Female    37      Yes
#7  1001151   Male    31       No
#8  1001151 Female    27       No
#9  1001161   Male    33      Yes
#10 1001161 Female    35      Yes

Upvotes: 3

Related Questions