Reputation: 3
I have a dataframe that consists of x rows and n columns. Each row represents a document and each column represents a category of tag associated with the document. The values in each cell are a 0 (meaning the tag isn't associated with the document) or a 1 (meaning the tag is associated with the document). My goal is to collapse this x * n dataframe into an n * n summed adjacency(?) matrix in order to visualize how often categories of tags hang together across all documents.
I apologize I am new to this type of analysis so I may not be using the correct terminology here...
For example, I might currently have the following dataframe:
data_have <- data.frame(Docname=c('ABC1', 'ABC2', 'ABC3', 'ABC4', 'ABC5'),
Cat1 = c(0,0,1,0,1),
Cat2 = c(0,0,0,0,1),
Cat3 = c(1,0,1,0,0),
Cat4 = c(1,1,0,0,0),
Cat5 = c(0,1,1,1,1))
Docname Cat1 Cat2 Cat3 Cat4 Cat5
ABC1 0 0 1 1 0
ABC2 0 0 0 1 1
ABC3 1 0 1 0 1
ABC4 0 0 0 0 1
ABC5 1 1 0 0 1
Which I would like to transform into a new dataframe that looks like this:
data_want <- data.frame(Tag=c('Cat1', 'Cat2', 'Cat3', 'Cat4', 'Cat5'),
Cat1 = c(NA,1,1,0,2),
Cat2 = c(1,NA,0,0,0),
Cat3 = c(1,0,NA,1,1),
Cat4 = c(0,0,1,NA,1),
Cat5 = c(2,0,1,1,NA))
Tag Cat1 Cat2 Cat3 Cat4 Cat5
Cat1 NA 1 1 0 2
Cat2 1 NA 0 0 0
Cat3 1 0 NA 1 1
Cat4 0 0 1 NA 1
Cat5 2 0 1 1 NA
As you can see, this second matrix (data_want) shows the sum of instances in which each tag category occurred together within a document. In this instance, Category 1 occurred with category 5 twice across documents, for example.
Ideally my ultimate goal is to be able to input the final matrix into a social network analysis visualization, so I can visualize how closely connected each tag category is to all other tag categories. But this matrix would be the first step to just look at where the clusters of pairings are most often occurring. Do you have any recommendations for R syntax that would accomplish this?
For context, my actual dataset has about 50 tag categories, so I definitely can't compute this by hand easily.
attempts
I tried squarematrix from the miscset package, which resulted in a list object rather than a matrix.
data_want1 = squarematrix(data_have)
Following another stackoverflow example, I also tried "tcrossprod" which didn't create the desired output
matrix_want <- as.matrix(data_have[-1])
resultMat <- tcrossprod(matrix_want)
diag(resultMat) <- 0
Resulting matrix (matrix_want):
V1 V2 V3 V4 V5
0 1 1 0 0
1 0 1 1 1
1 1 0 1 2
0 1 1 0 1
0 1 2 1 0
Honestly I am really stumped with how to accomplish this... I don't really even know where to begin and all of my google searches don't seem to point to the correct type of procedure. I'm sorry to not have more examples of what I tried to do here!
Thanks in advance!
Upvotes: 0
Views: 46
Reputation: 146070
This is matrix multiplication with the diagonals set to NA
:
m = as.matrix(data_have[-1])
result = t(m) %*% m
diag(result) = NA
result
# Cat1 Cat2 Cat3 Cat4 Cat5
# Cat1 NA 1 1 0 2
# Cat2 1 NA 0 0 1
# Cat3 1 0 NA 1 1
# Cat4 0 0 1 NA 1
# Cat5 2 1 1 1 NA
Upvotes: 2