Reputation: 177
My data looks something like this:
ID CSEX MID CMOB CYRB 1ST 2ND
1 1 1 1 1991 0 1
2 1 1 7 1989 1 0
3 2 2 1 1985 1 0
4 2 2 11 1985 0 1
5 1 2 9 1994 0 0
6 2 3 4 1992 1 0
7 2 4 2 1992 0 1
8 1 4 10 1983 1 0
With ID = child ID, CSEX = child sex, MID = mother ID, CMOB = month of birth and CYRB = year of birth, 1st = first born dummy, 2nd = second born dummy.
And I'm trying to make a dummy variable that takes the value 1 if the first two children born into a family (i.e. with the same MID) are the same sex.
I tried
Identifiers_age <- Identifiers_age %>% group_by(MPUBID) %>%
mutate(samesex =
as.numeric(((first == 1 & CSEX == 1) & (second == 1 & CSEX == 1))
| (first == 1 & CSEX == 2) & (second == 1 & CSEX ==2))))
But clearly this still only check the condition for each individual ID rather than by MID so returns a dummy which always takes value = 0.
Thanks
Edit for expected output:
ID CSEX MID CMOB CYRB 1ST 2ND SAMESEX
1 1 1 1 1991 0 1 1
2 1 1 7 1989 1 0 1
3 2 2 1 1985 1 0 1
4 2 2 11 1985 0 1 1
5 1 2 9 1994 0 0 1
6 2 3 4 1992 1 0 0
7 2 4 2 1992 0 1 0
8 1 4 10 1983 1 0 0
i.e. for any individual that is in a family where the first two children born are of the same sex, the dummy SAMESEX = 1
Edit2 (What I showed before was just an example I made, for the true dataset calling structure gives):
CPUBID MPUBID CSEX CMOB CYRB first second
<int> <int> <int> <int> <int> <dbl> <dbl>
1 201 2 2 3 1993 1 0
2 202 2 2 11 1994 0 1
3 301 3 2 6 1981 1 0
4 302 3 2 10 1983 0 1
5 303 3 2 4 1986 0 0
6 401 4 1 8 1980 1 0
7 403 4 2 3 1997 0 1
8 801 8 2 3 1976 1 0
9 802 8 1 5 1979 0 1
10 803 8 2 9 1982 0 0
and str:
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 11512 obs. of 7 variables:
$ CPUBID : int 201 202 301 302 303 401 403 801 802 803 ...
$ MPUBID : int 2 2 3 3 3 4 4 8 8 8 ...
$ CSEX : int 2 2 2 2 2 1 2 2 1 2 ...
$ CMOB : int 3 11 6 10 4 8 3 3 5 9 ...
$ CYRB : int 1993 1994 1981 1983 1986 1980 1997 1976 1979 1982 ...
$ first : num 1 0 1 0 0 1 0 1 0 0 ...
$ second : num 0 1 0 1 0 0 1 0 1 0 ...
Upvotes: 2
Views: 874
Reputation: 886938
May be this helps
library(dplyr)
Identifiers_age %>%
group_by(MID) %>%
mutate(ind1 = CSEX *`1ST`,
ind2 = CSEX *`2ND`,
SAMESEX = as.integer(n_distinct(c(ind1[ind1!=0],
ind2[ind2!=0]))==1 & sum(ind1) >0 & sum(ind2) > 0)) %>%
select(-ind1, -ind2)
# ID CSEX MID CMOB CYRB 1ST 2ND SAMESEX
# <int> <int> <int> <int> <int> <int> <int> <int>
#1 1 1 1 1 1991 0 1 1
#2 2 1 1 7 1989 1 0 1
#3 3 2 2 1 1985 1 0 1
#4 4 2 2 11 1985 0 1 1
#5 5 1 2 9 1994 0 0 1
#6 6 2 3 4 1992 1 0 0
#7 7 2 4 2 1992 0 1 0
#8 8 1 4 10 1983 1 0 0
Or it can be made slightly compact with
Identifiers_age %>%
group_by(MID) %>%
mutate(SAMESEX = as.integer(n_distinct(c(CSEX * NA^!`1ST`, CSEX * NA^!`2ND`),
na.rm = TRUE)==1 & sum(`1ST`) > 0 & sum(`2ND`) > 0))
Identifiers_age <- structure(list(ID = 1:8, CSEX = c(1L, 1L, 2L, 2L, 1L,
2L, 2L,
1L), MID = c(1L, 1L, 2L, 2L, 2L, 3L, 4L, 4L), CMOB = c(1L, 7L,
1L, 11L, 9L, 4L, 2L, 10L), CYRB = c(1991L, 1989L, 1985L, 1985L,
1994L, 1992L, 1992L, 1983L), `1ST` = c(0L, 1L, 1L, 0L, 0L, 1L,
0L, 1L), `2ND` = c(1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L)), .Names = c("ID",
"CSEX", "MID", "CMOB", "CYRB", "1ST", "2ND"), class = "data.frame",
row.names = c(NA, -8L))
Upvotes: 2