generate group (family) in R

Question

I have the following type of data:

Person <- c("A",  "B", "C",  "D",  "E",  "E",  "F",  "G", "H", "I")
MOM <- c(   NA,   NA,   NA,  "A",  "A",   NA,  "A",  "B", "C", NA)
DAD <- c(   NA,   NA,   NA,  "B",  "B",   NA,  "E",  "A", "B", NA)
Xv <- 1:10
myd <- data.frame (Person, MOM, DAD, Xv, stringsAsFactors=F)
myd 
       Person  MOM  DAD Xv
1       A    1
2       B    2
3       C    3
4       D    A    B  4
5       E    A    B  5
6       E    6
7       F    A    E  7
8       G    B    A  8
9       H    C    B  9
10      I   10

This data include Person and their Mom and Dad columns. I would like to create family group for this data. NA is information missing. A family is defined that has common MOM and DAD. Founders are those that have both NA, family = 0.

Here is what I could figure out, which is imperfect for me:

fun <- function(i) {
  i1 <- if (is.na(myd[i, 2])) i else match(myd[i, 2], myd[1:i, 2])
  i2 <- if (is.na(myd[i, 3])) i else match(myd[i, 3], myd[1:i, 3])
  min(i1, i2)
}
myd$family <- as.numeric(factor(sapply(1:nrow(myd), fun)))
  Person  MOM  DAD Xv family
1       A    1      1
2       B    2      2
3       C    3      3
4       D    A    B  4      4
5       E    A    B  5      4
6       E    6      5
7       F    A    E  7      4
8       G    B    A  8      6
9       H    C    B  9      4
10      I   10      7

The above function is imperfect in the sense: The family data do not include data of their parents, for example family 4 should include data for A and B. Thus complete family would look like:

1       A    1      1
2       B    2      2
4       D    A    B  4      4
5       E    A    B  5      4

Another thing (at least for my purpose is), Being DAD = A and MOM = B is same as DAD = B, and MOM = A. Thus the family 4 and 6 are product of same A and B parents, so should be same.

4       D    A    B  4      4
5       E    A    B  5      4
8       G    B    A  8      6

Thus expected output is:

Person  MOM  DAD Xv     family
# founders 
1       A    1      0
2       B    2      0
3       C    3      0
10      I   10      0
6       E    6      0
# Family 1
1       A    1      1
2       B    2      1
4       D    A    B  4      1
5       E    A    B  5      1
8       G    B    A  8      1
# Family 2
1       A    1      2
6       E    6      2
7       F    A    E  7      2
# Family 3
2       B    2      3
3       C    3      3
9       H    C    B  9      3

Edits:

It is pity (good !) in human genetics we need to work on similar variables - family, trio, mom (parent1, mother, female), father (dad, parent2, male), individual / subject etc. This makes everything similar and issue are similar.

  Family vs Trio 
  1 Nuclear family 
  A  x   B
      |
   C   D  E

  Trio -> 3 trios  
  A x B      A x B       A x B
     |         |            |
     C          D           E

Edits from the questioner: I you agree with the comments below as homework, please do not anwer the question for sometime (the time you think good enough that homework submission time has passed). If I get answer I will post it later (in 3 months or so).

Edits

Founders definition - those who have both parents unknown whether they are any sons / daughters, so they have in both MOM and DAD columns. These are considered family 0 as they are part of other families but the list is not real family.

 Person  MOM  DAD Xv     family
    1       A    1      0
    2       B    2      0
    3       C    3      0
    10      I   10      0
    6       E    6      0

** Family definition * A family consists of parents (MOM and DAD) and all son and daughters. If Person DAD and MOM matches with Another Person DAD and MOM, they should be considered a family. For example, D and E person in the following list has MOM = A and DAD = B, these two individuals together with D and E consists of a family. Now we need to recycle data for their parents (A and B ) from the founders list (family 0).

 # Family 1
        Person  MOM  DAD     Xv     family
    1       A    1      1
    2       B    2      1
    4       D    A    B  4      1
    5       E    A    B  5      1

Also in contrary to human situation here a individual can be MOM or DAD (can switch sex), so progeny produced by A (MOM) and B (DAD) are same as pro-genies developed by B (MOM) and A(DAD), thus we need to add the following to individual to family 1 list.

       Person  MOM   DAD     Xv     family
   8       G       B    A       8      1

Thus complete list for family 1 becomes:

     Person  MOM   DAD Xv     family
1       A    1      1
2       B    2      1
4       D    A    B  4      1
5       E    A    B  5      1
8       G    B    A  8      1

The family 1 can be diagrammatically sketched as:

            MOM   x   DAD             MOM   x   DAD
              A  |   B        or       B  |     A 
            -----------------          ------
           |                 |           |
           D                 E           G

Here is partial solution:

myd1 <- data.frame(myd$DAD, myd$MOM) 
myd$family<-as.factor(apply(myd1,1,function(x){paste(x[order(x)],collapse='-')}))
   Person  MOM  DAD Xv family
1       A    1  NA-NA
2       B    2  NA-NA
3       C    3  NA-NA
4       D    A    B  4    A-B
5       E    A    B  5    A-B
6       E    6  NA-NA
7       F    A    E  7    A-E
8       G    B    A  8    A-B
9       H    C    B  9    B-C
10      I   10  NA-NA

It does not give family number rather family of A and B. NA-NA is founders and it orders before collapse so the A-B becomes B-A.

What is issue remaining is that A-B family needs data from Person A and B recycled (although they are in family NA-NA group) .

  Person  MOM  DAD Xv family
1       A    1  NA-NA
2       B    2  NA-NA
4       D    A    B  4    A-B
5       E    A    B  5    A-B

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer

I'm not sure if you've figured this out yet, but here is one solution.

First, your data:

# Your data
myd <- data.frame(Person = c("A", "B", "C", "D", "E", 
                             "E", "F", "G", "H", "I"),
                  MOM = c(NA, NA, NA, "A", "A", NA, "A", "B", "C", NA),
                  DAD = c(NA, NA, NA, "B", "B", NA, "E", "A", "B", NA),
                  Xv = 1:10, stringsAsFactors=F)

Second, we identify the families by merging together columns 2 and 3 from your original data. We will use this to split your data.frame into a list.

# Identifying the families
fam = apply(myd[2:3], 1, function(x) paste0(sort(x), collapse=" "))

Third, we split the data.frame into a list. In this case, we end up with a list of four data.frames: one for the founders, and one for each family.

# Splitting the data by founders and families
temp_1 = split(myd, fam)
names(temp_1)[1] = "Founders"

Fourth, we do some simple matching and subsetting to identify which founders belong to which families.

# Identify which families the founders belong to
temp_2 = lapply(1:length(temp_1),
                function(x) temp_1[[1]][which(temp_1[[1]]$Person %in% 
                  unique(unlist(temp_1[[x]][,c(2,3)], use.names=FALSE))),])

And, finally, we rbind this data together.

# "Merging" (with rbind) founders and their families
OUT = lapply(1:length(temp_1), function(x) rbind(temp_2[[x]], temp_1[[x]]))
names(OUT) = names(temp_1)

This is the output:

OUT
# $Founders
#    Person  MOM  DAD Xv
# 1       A    1
# 2       B    2
# 3       C    3
# 6       E    6
# 10      I   10
# 
# $`A B`
#   Person  MOM  DAD Xv
# 1      A    1
# 2      B    2
# 4      D    A    B  4
# 5      E    A    B  5
# 8      G    B    A  8
# 
# $`A E`
#   Person  MOM  DAD Xv
# 1      A    1
# 6      E    6
# 7      F    A    E  7
# 
# $`B C`
#   Person  MOM  DAD Xv
# 2      B    2
# 3      C    3
# 9      H    C    B  9

Update: data.frame output

If you prefer a data.frame to a list, you can do the following after completing the previous steps:

OUT = do.call("rbind", 
              lapply(1:length(OUT), 
                     function(x) cbind(OUT[[x]], fam = names(OUT[x]))))
OUT
#    Person  MOM  DAD Xv      fam
# 1       A    1 Founders
# 2       B    2 Founders
# 3       C    3 Founders
# 6       E    6 Founders
# 10      I   10 Founders
# 11      A    1      A B
# 21      B    2      A B
# 4       D    A    B  4      A B
# 5       E    A    B  5      A B
# 8       G    B    A  8      A B
# 12      A    1      A E
# 61      E    6      A E
# 7       F    A    E  7      A E
# 22      B    2      B C
# 31      C    3      B C
# 9       H    C    B  9      B C

generate group (family) in R

Answers (2)

Update: data.frame output

Related Questions