Regan Odongo
Regan Odongo

Reputation: 3

Pairwise comparison of dataframe row elements

I want to find the number of all common elements in rows of a dataframe

name           members
x1            A,B,N,K,Y,G
x2            J,L,M,N,T
x3            G,H,S,J,D,F
x4            J,K,H,F,H,D,L

name         common       name
x1                   6               x1
x1                   2               x2
x1                     -             x3
x1                      -            x4
x2                       -           x1
x2                5        -          x2
x2                         -         x3
x2                          -        x4
x3                           -       x1
x3                            -      x2
x3                   6          -     x3
x3                              -    x4
x4                               -   x1
x4                                -  x2
x4                                 - x3
x4                  7                -x4

Upvotes: 0

Views: 699

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269501

1) dplyr/tidyr For each row, create a separate row for each member using separate_rows and join that to itself by members. Then compute counts and complete it.

library(dplyr)
library(tidyr)

DF %>%
  separate_rows(members) %>%
  distinct %>%
  inner_join(., ., by = "members") %>%
  count(name.x, name.y) %>%
  complete(name.x, name.y)

giving:

# A tibble: 16 x 3
   name.x name.y     n
   <chr>  <chr>  <int>
 1 x1     x1         6
 2 x1     x2         1
 3 x1     x3         1
 4 x1     x4         1
 5 x2     x1         1
 6 x2     x2         5
 7 x2     x3         1
 8 x2     x4         2
 9 x3     x1         1
10 x3     x2         1
11 x3     x3         6
12 x3     x4         4
13 x4     x1         1
14 x4     x2         2
15 x4     x3         4
16 x4     x4         6

2) Base R Create a function which computes the number of intersections between two member components. Then use outer to apply it to each pair and convert to a data.frame.

Scan <- function(x) scan(text = x, what = "", sep = ",", quiet = TRUE)
countSame <- function(x, y) length(intersect(Scan(x), Scan(y)))
x <- setNames(DF$members, DF$name)
as.data.frame.table(outer(x, x, Vectorize(countSame)))

giving:

   Var1 Var2 Freq
1    x1   x1    6
2    x2   x1    1
3    x3   x1    1
4    x4   x1    1
5    x1   x2    1
6    x2   x2    5
7    x3   x2    1
8    x4   x2    2
9    x1   x3    1
10   x2   x3    1
11   x3   x3    6
12   x4   x3    4
13   x1   x4    1
14   x2   x4    2
15   x3   x4    4
16   x4   x4    6

Although the question asked for the data.frame form above you might prefer a 2d table which can be produced by just omitting as.data.frame.table from the last line of code.

   x1 x2 x3 x4
x1  6  1  1  1
x2  1  5  1  2
x3  1  1  6  4
x4  1  2  4  6

2a) A variation on (2) which is only two lines long can be formed by applying strsplit to the members and then computing the length of the intersection of pairs using outer. Finally we convert to a data frame. A 2d table could again be formed by omitting as.data.frame.table.)

x <- with(DF, setNames(strsplit(members, ","), name))
as.data.frame.table(outer(x, x, Vectorize(function(x, y) length(intersect(x, y)))))

giving:

   Var1 Var2 Freq
1    x1   x1    6
2    x2   x1    1
3    x3   x1    1
4    x4   x1    1
5    x1   x2    1
6    x2   x2    5
7    x3   x2    1
8    x4   x2    2
9    x1   x3    1
10   x2   x3    1
11   x3   x3    6
12   x4   x3    4
13   x1   x4    1
14   x2   x4    2
15   x3   x4    4
16   x4   x4    6

Note

Lines <- "name           members
x1            A,B,N,K,Y,G
x2            J,L,M,N,T
x3            G,H,S,J,D,F
x4            J,K,H,F,H,D,L"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)

Update

In (2a) with R 4.3 as.data.frame.table could be replaced with array2DF .

Upvotes: 2

Rui Barradas
Rui Barradas

Reputation: 76402

I believe the following code does what the question asks for. But note that I find it complicated, with two merge instructions, maybe someone else will find a simpler solution.

fun <- function(DF){
  ex <- expand.grid(Var2 = DF[['name']], name = DF[['name']])[2:1]
  members <- as.character(DF[['members']])
  merge(DF, ex)
}

tmp <- merge(df1, fun(df1))
o <- order(tmp[[3]])
tmp$members2 <- tmp$members[o]

tmp$common <- apply(tmp[c(2, 4)], 1, function(x){
  y1 <- unlist(strsplit(as.character(x[1]), ","))
  y2 <- unlist(strsplit(as.character(x[2]), ","))
  length(intersect(y1, y2))
})

res <- tmp[c(1, 5, 3)]
names(res)[3] <- "name2"

head(res)
#  name common name2
#1   x1      6    x1
#2   x1      1    x2
#3   x1      1    x3
#4   x1      1    x4
#5   x2      1    x1
#6   x2      5    x2

Final clean up.

rm(tmp)

Data.

df1 <- read.table(text = "
name           members
x1            A,B,N,K,Y,G
x2            J,L,M,N,T
x3            G,H,S,J,D,F
x4            J,K,H,F,H,D,L
", header = TRUE)

Upvotes: 0

Related Questions