Find pattern of combination by group in R

Question

I have data frame, my goal is finding the pattern of combination var1 by ID, if at least 3 categories the same for each group, we set "Yes", and then which ID have the same combination.

ID1: I have 4 unique categories (A,B,C,D)
ID2: I have 4 unique categories (B,C,D,F)
ID3: I have 3 unique categories (A,B,C)
ID4: I have 2 unique categories (A,B)
ID5: I have 4 unique categories (C,D,F)

We can see ID1, ID2 have at least 3 categories the same (B,C,D), ID1 and ID3 have (A,B,C),and ID2, ID5 have at least 3 the same (C,D,F). So there are 4 ID will have "Yes" only ID4=="No".

ID <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,4,4,5,5,5,5,5)
var1 <- c("A","B","C","A","D","D","C","D","B","F","A","B","C","C",
      "A","B","D","D","C","C","F")
df <- data.frame(ID,var1)
    ID var1
1   1    A
2   1    B
3   1    C
4   1    A
5   1    D
6   2    D
7   2    C
8   2    D
9   2    B
10  2    F
11  3    A
12  3    B
13  3    C
14  3    C
15  4    A
16  4    B
17  5    D
18  5    D
19  5    C
20  5    C
21  5    F

My output will be

    ID var1 var2    var3
1   1    A  Yes 1-2
2   1    B  Yes 1-2
3   1    C  Yes 1-2
4   1    A  Yes 1-2
5   1    D  Yes 1-2
6   2    D  Yes 1-2
7   2    C  Yes 1-2
8   2    D  Yes 1-2
9   2    B  Yes 1-2
10  2    F  Yes 1-2
11  3    A  Yes 1-3
12  3    B  Yes 1-3
13  3    C  Yes 1-3
14  3    C  Yes 1-3
15  4    A   No       4
16  4    B   No       4
17  5    D  Yes     2-5
18  5    D  Yes     2-5
19  5    C  Yes     2-5
20  5    C  Yes     2-5
21  5    F  Yes     2-5

Thanks for advance.

Weihuang Wong · Accepted Answer

The problem is essentially one of constructing an adjacency table based on common memberships, e.g. Working with Bipartite/Affiliation Network Data in R. To do that, we make a table out of the data (after eliminating duplicates), and then take the cross-product.

dd <- unique(df)
tab <- table(dd)
dd <- crossprod(t(tab))
diag(dd) <- 0
#    ID
# ID  1 2 3 4 5
#   1 0 3 3 2 2
#   2 3 0 2 1 3
#   3 3 2 0 2 1
#   4 2 1 2 0 0
#   5 2 3 1 0 0

The table above allows us to see the number of categories that IDs share. Now we just have to go through the rows; for each row, I select the first ID that has a value of at least 3 (matched).

matched <- apply(dd >= 3, MAR = 1, function(x) which(x == TRUE)[1])   
#  1  2  3  4  5 
#  2  1  1 NA  2

So "1" matched with "2", "2" matched with "1", "3" matched with "1", "4" has no matches, "5" matched with "2". Finish off by manipulating this output to get the desired final product:

out <- apply(cbind(as.numeric(names(matched)), matched), MAR = 1, function(x) {
  if (any(is.na(x))) {
    data.frame(var2 = "No", var3 = x[1])
  } else {
    data.frame(var2 = "Yes", var3 = paste(sort(x), collapse = "-"))
  }
})
out <- plyr::ldply(out, .id = "ID")

merge(df, out, all.x = TRUE)
#    ID var1 var2 var3
# 1   1    A  Yes  1-2
# 2   1    B  Yes  1-2
# 3   1    C  Yes  1-2
# 4   1    A  Yes  1-2
# 5   1    D  Yes  1-2
# 6   2    D  Yes  1-2
# 7   2    C  Yes  1-2
# 8   2    D  Yes  1-2
# 9   2    B  Yes  1-2
# 10  2    F  Yes  1-2
# 11  3    A  Yes  1-3
# 12  3    B  Yes  1-3
# 13  3    C  Yes  1-3
# 14  3    C  Yes  1-3
# 15  4    A   No    4
# 16  4    B   No    4
# 17  5    D  Yes  2-5
# 18  5    D  Yes  2-5
# 19  5    C  Yes  2-5
# 20  5    C  Yes  2-5
# 21  5    F  Yes  2-5

Find pattern of combination by group in R

Answers (1)

Related Questions