Reputation: 621
I have the following type of data:
Person <- c("A", "B", "C", "D", "E", "E", "F", "G", "H", "I")
MOM <- c( NA, NA, NA, "A", "A", NA, "A", "B", "C", NA)
DAD <- c( NA, NA, NA, "B", "B", NA, "E", "A", "B", NA)
Xv <- 1:10
myd <- data.frame (Person, MOM, DAD, Xv, stringsAsFactors=F)
myd
Person MOM DAD Xv
1 A <NA> <NA> 1
2 B <NA> <NA> 2
3 C <NA> <NA> 3
4 D A B 4
5 E A B 5
6 E <NA> <NA> 6
7 F A E 7
8 G B A 8
9 H C B 9
10 I <NA> <NA> 10
This data include Person and their Mom and Dad columns. I would like to create family group for this data. NA is information missing. A family is defined that has common MOM and DAD. Founders are those that have both NA, family = 0.
Here is what I could figure out, which is imperfect for me:
fun <- function(i) {
i1 <- if (is.na(myd[i, 2])) i else match(myd[i, 2], myd[1:i, 2])
i2 <- if (is.na(myd[i, 3])) i else match(myd[i, 3], myd[1:i, 3])
min(i1, i2)
}
myd$family <- as.numeric(factor(sapply(1:nrow(myd), fun)))
Person MOM DAD Xv family
1 A <NA> <NA> 1 1
2 B <NA> <NA> 2 2
3 C <NA> <NA> 3 3
4 D A B 4 4
5 E A B 5 4
6 E <NA> <NA> 6 5
7 F A E 7 4
8 G B A 8 6
9 H C B 9 4
10 I <NA> <NA> 10 7
The above function is imperfect in the sense: The family data do not include data of their parents, for example family 4 should include data for A and B. Thus complete family would look like:
1 A <NA> <NA> 1 1
2 B <NA> <NA> 2 2
4 D A B 4 4
5 E A B 5 4
Another thing (at least for my purpose is), Being DAD = A and MOM = B is same as DAD = B, and MOM = A. Thus the family 4 and 6 are product of same A and B parents, so should be same.
4 D A B 4 4
5 E A B 5 4
8 G B A 8 6
Thus expected output is:
Person MOM DAD Xv family
# founders
1 A <NA> <NA> 1 0
2 B <NA> <NA> 2 0
3 C <NA> <NA> 3 0
10 I <NA> <NA> 10 0
6 E <NA> <NA> 6 0
# Family 1
1 A <NA> <NA> 1 1
2 B <NA> <NA> 2 1
4 D A B 4 1
5 E A B 5 1
8 G B A 8 1
# Family 2
1 A <NA> <NA> 1 2
6 E <NA> <NA> 6 2
7 F A E 7 2
# Family 3
2 B <NA> <NA> 2 3
3 C <NA> <NA> 3 3
9 H C B 9 3
Edits:
It is pity (good !) in human genetics we need to work on similar variables - family, trio, mom (parent1, mother, female), father (dad, parent2, male), individual / subject etc. This makes everything similar and issue are similar.
Family vs Trio
1 Nuclear family
A x B
|
C D E
Trio -> 3 trios
A x B A x B A x B
| | |
C D E
Edits from the questioner: I you agree with the comments below as homework, please do not anwer the question for sometime (the time you think good enough that homework submission time has passed). If I get answer I will post it later (in 3 months or so).
Edits
Founders definition - those who have both parents unknown whether they are any sons / daughters, so they have in both MOM and DAD columns. These are considered family 0 as they are part of other families but the list is not real family.
Person MOM DAD Xv family
1 A <NA> <NA> 1 0
2 B <NA> <NA> 2 0
3 C <NA> <NA> 3 0
10 I <NA> <NA> 10 0
6 E <NA> <NA> 6 0
** Family definition * A family consists of parents (MOM and DAD) and all son and daughters. If Person DAD and MOM matches with Another Person DAD and MOM, they should be considered a family. For example, D and E person in the following list has MOM = A and DAD = B, these two individuals together with D and E consists of a family. Now we need to recycle data for their parents (A and B ) from the founders list (family 0).
# Family 1
Person MOM DAD Xv family
1 A <NA> <NA> 1 1
2 B <NA> <NA> 2 1
4 D A B 4 1
5 E A B 5 1
Also in contrary to human situation here a individual can be MOM or DAD (can switch sex), so progeny produced by A (MOM) and B (DAD) are same as pro-genies developed by B (MOM) and A(DAD), thus we need to add the following to individual to family 1 list.
Person MOM DAD Xv family
8 G B A 8 1
Thus complete list for family 1 becomes:
Person MOM DAD Xv family
1 A <NA> <NA> 1 1
2 B <NA> <NA> 2 1
4 D A B 4 1
5 E A B 5 1
8 G B A 8 1
The family 1 can be diagrammatically sketched as:
MOM x DAD MOM x DAD
A | B or B | A
----------------- ------
| | |
D E G
Here is partial solution:
myd1 <- data.frame(myd$DAD, myd$MOM)
myd$family<-as.factor(apply(myd1,1,function(x){paste(x[order(x)],collapse='-')}))
Person MOM DAD Xv family
1 A <NA> <NA> 1 NA-NA
2 B <NA> <NA> 2 NA-NA
3 C <NA> <NA> 3 NA-NA
4 D A B 4 A-B
5 E A B 5 A-B
6 E <NA> <NA> 6 NA-NA
7 F A E 7 A-E
8 G B A 8 A-B
9 H C B 9 B-C
10 I <NA> <NA> 10 NA-NA
It does not give family number rather family of A and B. NA-NA is founders and it orders before collapse so the A-B becomes B-A.
What is issue remaining is that A-B family needs data from Person A and B recycled (although they are in family NA-NA group) .
Person MOM DAD Xv family
1 A <NA> <NA> 1 NA-NA
2 B <NA> <NA> 2 NA-NA
4 D A B 4 A-B
5 E A B 5 A-B
Upvotes: 2
Views: 514
Reputation: 193527
I'm not sure if you've figured this out yet, but here is one solution.
First, your data:
# Your data
myd <- data.frame(Person = c("A", "B", "C", "D", "E",
"E", "F", "G", "H", "I"),
MOM = c(NA, NA, NA, "A", "A", NA, "A", "B", "C", NA),
DAD = c(NA, NA, NA, "B", "B", NA, "E", "A", "B", NA),
Xv = 1:10, stringsAsFactors=F)
Second, we identify the families by merging together columns 2 and 3 from your original data. We will use this to split
your data.frame
into a list.
# Identifying the families
fam = apply(myd[2:3], 1, function(x) paste0(sort(x), collapse=" "))
Third, we split the data.frame
into a list. In this case, we end up with a list of four data.frame
s: one for the founders, and one for each family.
# Splitting the data by founders and families
temp_1 = split(myd, fam)
names(temp_1)[1] = "Founders"
Fourth, we do some simple matching and subsetting to identify which founders belong to which families.
# Identify which families the founders belong to
temp_2 = lapply(1:length(temp_1),
function(x) temp_1[[1]][which(temp_1[[1]]$Person %in%
unique(unlist(temp_1[[x]][,c(2,3)], use.names=FALSE))),])
And, finally, we rbind
this data together.
# "Merging" (with rbind) founders and their families
OUT = lapply(1:length(temp_1), function(x) rbind(temp_2[[x]], temp_1[[x]]))
names(OUT) = names(temp_1)
This is the output:
OUT
# $Founders
# Person MOM DAD Xv
# 1 A <NA> <NA> 1
# 2 B <NA> <NA> 2
# 3 C <NA> <NA> 3
# 6 E <NA> <NA> 6
# 10 I <NA> <NA> 10
#
# $`A B`
# Person MOM DAD Xv
# 1 A <NA> <NA> 1
# 2 B <NA> <NA> 2
# 4 D A B 4
# 5 E A B 5
# 8 G B A 8
#
# $`A E`
# Person MOM DAD Xv
# 1 A <NA> <NA> 1
# 6 E <NA> <NA> 6
# 7 F A E 7
#
# $`B C`
# Person MOM DAD Xv
# 2 B <NA> <NA> 2
# 3 C <NA> <NA> 3
# 9 H C B 9
If you prefer a data.frame
to a list
, you can do the following after completing the previous steps:
OUT = do.call("rbind",
lapply(1:length(OUT),
function(x) cbind(OUT[[x]], fam = names(OUT[x]))))
OUT
# Person MOM DAD Xv fam
# 1 A <NA> <NA> 1 Founders
# 2 B <NA> <NA> 2 Founders
# 3 C <NA> <NA> 3 Founders
# 6 E <NA> <NA> 6 Founders
# 10 I <NA> <NA> 10 Founders
# 11 A <NA> <NA> 1 A B
# 21 B <NA> <NA> 2 A B
# 4 D A B 4 A B
# 5 E A B 5 A B
# 8 G B A 8 A B
# 12 A <NA> <NA> 1 A E
# 61 E <NA> <NA> 6 A E
# 7 F A E 7 A E
# 22 B <NA> <NA> 2 B C
# 31 C <NA> <NA> 3 B C
# 9 H C B 9 B C
Upvotes: 3
Reputation: 263352
If you want a character vector that is the same for each "family" then using the interaction
function would be more compact. Something along these lines:
myd$fam <- with( myd, as.character( interaction(MOM,DAD)))
myd$fam[ is.na(myd$fam) ] <- 0
If you want numbers (which seems unwise, but that is how you offered your request) then instead of as.character
, use as.numeric
myd$fam <- with( myd, as.numeric( interaction(MOM,DAD)))
myd$fam[ is.na(myd$fam) ] <- 0
I never figured out how you could have "A" represent both a MOM and a DAD. You may need to work on explaining how your understand that aspect of human genetics. For the splitting by family use split
> split(myd, myd$fam)
$`0`
Person MOM DAD Xv fam
1 A <NA> <NA> 1 0
2 B <NA> <NA> 2 0
3 C <NA> <NA> 3 0
6 E <NA> <NA> 6 0
10 I <NA> <NA> 10 0
$`2`
Person MOM DAD Xv fam
8 G B A 8 2
$`4`
Person MOM DAD Xv fam
4 D A B 4 4
5 E A B 5 4
$`6`
Person MOM DAD Xv fam
9 H C B 9 6
$`7`
Person MOM DAD Xv fam
7 F A E 7 7
Upvotes: 2