Reputation: 4299
I am trying to fill missing cases with information based a related individual (couple).
My data looks like this
hserial sex age children
1 1001041 Male 30 Yes
2 1001041 Female 32 Yes
3 1001061 Male 22 No
4 1001061 Female 21 No
5 1001091 Male 38 Yes
6 1001091 Female 37 Yes
7 1001151 Male 31 No
8 1001151 Female 27 Not eligible
9 1001161 Male 33 Yes
10 1001161 Female 35 Yes
So hserial
is the couple identifier. Row number 8 has a missing case Not eligible
but the information is available from the partner (row number 7).
I am trying to find a neat way to fill these missing with partner's info.
I was thinking doing something like
library(dplyr)
childsum = dta %>% group_by(hserial, sex, children) %>%
summarise(n = n()) %>% spread(sex, children)
I will get
hserial n Male Female
1 1001041 1 Yes Yes
2 1001061 1 No No
3 1001091 1 Yes Yes
4 1001151 1 No Not eligible
5 1001161 1 Yes Yes
Then I could do something like
childsum$Male = ifelse(childsum$Male == 'Not eligible', childsum$Female, childsum$Male)
childsum$Female = ifelse(childsum$Female == 'Not eligible', childsum$Male, childsum$Female)
So for every missing of Male
fill with Female
info and vice versa.
Then merge back the results in order to get
hserial sex age children
1 1001041 Male 30 Yes
2 1001041 Female 32 Yes
3 1001061 Male 22 No
4 1001061 Female 21 No
5 1001091 Male 38 Yes
6 1001091 Female 37 Yes
7 1001151 Male 31 No
8 1001151 Female 27 No
9 1001161 Male 33 Yes
10 1001161 Female 35 Yes
Any idea how to do this is a neat way ?
dta = structure(list(hserial = c(1001041, 1001041, 1001061, 1001061,
1001091, 1001091, 1001151, 1001151, 1001161, 1001161), sex = structure(c(1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("Male", "Female"
), class = "factor"), age = c(30, 32, 22, 21, 38, 37, 31, 27,
33, 35), children = structure(c(5L, 5L, 6L, 6L, 5L, 5L, 6L, 4L,
5L, 5L), .Label = c("DNA Does not apply", "NA No answer", "NA No answer",
"Not eligible", "Yes", "No"), class = "factor")), class = "data.frame", .Names = c("hserial",
"sex", "age", "children"), row.names = c(NA, -10L))
Upvotes: 3
Views: 73
Reputation: 70266
Here's an approach which assumes that any couple (consisting of two hserial
s) should always have the same yes/no entry in children
unless both persons have Not eligible
entries. Therefore, it computes per couple the setdiff
of available children
info and Not eligible
. In cases where all (both) entries are "Not eligible", it returns NA
, since I think that's a better way to handle missing values (as you know there are many specialized functions you can use with NA
s that you cannot use the same way for Not eligible
entries).
dta %>%
group_by(hserial) %>%
mutate(children = if(all(children == "Not eligible")) NA_character_ else
setdiff(children, "Not eligible"))
#Source: local data frame [10 x 4]
#Groups: hserial [5]
#
# hserial sex age children
# (dbl) (fctr) (dbl) (chr)
#1 1001041 Male 30 Yes
#2 1001041 Female 32 Yes
#3 1001061 Male 22 No
#4 1001061 Female 21 No
#5 1001091 Male 38 Yes
#6 1001091 Female 37 Yes
#7 1001151 Male 31 No
#8 1001151 Female 27 No
#9 1001161 Male 33 Yes
#10 1001161 Female 35 Yes
Upvotes: 3