Reputation: 27
I have a text dataframe of 792 agreements, and I have pre-processed them and converted them into a dfm. I am trying to experiment with similarity scores, and I decided to do both jaccard and cosine similarity.
When I do the cosine similarity, it takes half a minute and I get the results. But for the past two days, whenever I do the same with jaccard, my computer started whirring up and R terminates. Is there something I'm missing here? Does the jaccard function not work anymore?
I put the code below.
library(quanteda)
library(tidyr)
# view the resulting cosine similarity matrix
s1 <- textstat_simil(trimmed_dfm, method = "cosine", margin = "documents")
#Convert the output into a into a dataframe (first needs to be converted to a matrix)
cosine_simil_df <- as.data.frame(as.matrix(s1))
#Create a column with the row names of the matrix
cosine_simil_df$PTA1 <- row.names(cosine_simil_df)
#Use pivot longer gather verb to reshape the data in Tidy format
cosine_simil_df_final <- pivot_longer(cosine_simil_df, cols = -PTA1, names_to = "PTA2", values_to = "similarity")
head(cosine_simil_df_final)
##### Let's try with the Jaccard similarity
s2<- textstat_simil(trimmed_dfm, method = "jaccard", margin = "documents")
#this line is when it all goes wrong
jaccard_simil_df<- as.data.frame(as.matrix(s2))
jaccard_simil_df$PTA1 <- row.names(jaccard_simil_df)
Upvotes: 0
Views: 213
Reputation: 890
I did not optimize function for the Jaccard similarity as much as for the cosine similarity. You could try drop0 = TRUE
to reduce memory usage. proxyC::simil()
is the package behind textstat_simil()
.
proxyC::simil(matrix(c(1, 0, 0, 1), nrow = 2),
matrix(c(2, 2, 0, 0), nrow = 2), method = "jaccard")
#> 2 x 2 sparse Matrix of class "dgTMatrix"
#>
#> [1,] 1 1
#> [2,] 0 0
proxyC::simil(matrix(c(1, 0, 0, 1), nrow = 2),
matrix(c(2, 2, 0, 0), nrow = 2), method = "jaccard", drop0 = TRUE)
#> 2 x 2 sparse Matrix of class "dgTMatrix"
#>
#> [1,] 1 1
#> [2,] . .
Upvotes: 1