Calculate categories co-occurrences in R (tidyverse)

Question

I have the following dataframe


| Document | CatA | CatB | CatC | CatD |
|----------|------|------|------|------|
| A        | 1    | 0    | 1    | 1    |
| B        | 0    | 1    | 1    | 0    |
| C        | 1    | 1    | 0    | 1    |

indicating that categories CatA, CatC, and CatD co-occur in document A, etc.
I need to calculate the categories co-occurrence matrix over all documents, for example, as follow:

|      | CatA | CatB | CatC | CatD |
|------|------|------|------|------|
| CatA | NA   | 1    | 1    | 2    |
| CatB | 1    | NA   | 1    | 1    |
| CatC | 1    | 1    | NA   | 1    |
| CatD | 2    | 1    | 1    | NA   |

ekstroem · Accepted Answer

If your dataframe only contains zeros and ones then you can generate the co-occurrence matrix directly in base R using the crossprod() function:

x <- cbind(c(1,0,1), c(0, 1, 1), c(1,1,0), c(1,0,1))
crossprod(x)

which produces

     [,1] [,2] [,3] [,4]
[1,]    2    1    1    2
[2,]    1    2    1    1
[3,]    1    1    2    1
[4,]    2    1    1    2

The diagonal can then be set to NA using

res <- crossprod(x)
diag(res) <- NA
res

     [,1] [,2] [,3] [,4]
[1,]   NA    1    1    2
[2,]    1   NA    1    1
[3,]    1    1   NA    1
[4,]    2    1    1   NA

Calculate categories co-occurrences in R (tidyverse)

Answers (1)

Related Questions