Reputation: 1
I managed to calculate the difference between two texts with the cosine method. With the following:
library("quanteda")
dfmat <- corpus_subset(corpusnew) %>%
tokens(remove_punct = TRUE) %>%
tokens_remove(stopwords("portuguese")) %>%
dfm()
(tstat1 <- textstat_simil(dfmat, method = "cosine", margin = "documents"))
as.matrix(tstat1)
And I get the following matrix:
text1 text2 text3 text4 text5
text1 1.000 0.801 0.801 0.801 0.798
However, I would like to know the actual words that account for the difference and not by how much they differ or are alike. Is there a way?
Thanks
Upvotes: 0
Views: 215
Reputation: 14902
This question only has pairwise answers, since each computation of similarity occurs between a single pair of documents. It's also not entirely clear what output you want to see, so I'll take my best guess and demonstrate a few possibilities.
So if you wanted to the features most different between text1 and text2, for instance, you could slice the documents you want to compare from the dfm, and then change margin = "features"
to get the similarity of the document across features.
library("quanteda")
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
dfmat <- tokens(data_corpus_inaugural[1:5], remove_punct = TRUE) %>%
tokens_remove(stopwords("en")) %>%
dfm()
library("quanteda.textstats")
sim <- textstat_simil(dfmat[1:2, ], margin = "features", method = "cosine")
Now we can examine the pairwise similarities (greatest and smallest) by converting the similarity matrix to a data.frame, and sorting it.
# most similar features
as.data.frame(sim) %>%
dplyr::arrange(desc(cosine)) %>%
dplyr::filter(cosine < 1) %>%
head(10)
#> feature1 feature2 cosine
#> 1 present may 0.9994801
#> 2 country may 0.9994801
#> 3 may government 0.9991681
#> 4 present citizens 0.9988681
#> 5 country citizens 0.9988681
#> 6 present people 0.9988681
#> 7 country people 0.9988681
#> 8 present united 0.9988681
#> 9 country united 0.9988681
#> 10 present government 0.9973337
# most different features
as.data.frame(sim) %>%
dplyr::arrange(cosine) %>%
head(10)
#> feature1 feature2 cosine
#> 1 government upon 0.1240347
#> 2 government chief 0.1240347
#> 3 government magistrate 0.1240347
#> 4 government proper 0.1240347
#> 5 government arrive 0.1240347
#> 6 government endeavor 0.1240347
#> 7 government express 0.1240347
#> 8 government high 0.1240347
#> 9 government sense 0.1240347
#> 10 government entertain 0.1240347
Created on 2022-03-08 by the reprex package (v2.0.1)
There are other ways to compare the words most different between documents, such as "keyness" - for instance quanteda.textstats::textstat_keyness()
between text1 and text2, where the head and tail of the resulting data.frame will tell you the most dissimilar features.
Upvotes: 0
Reputation: 890
How about comparing tokens using setdiff()
?
require(quanteda)
toks <- tokens(corpus(c("a b c d", "a e")))
toks
#> Tokens consisting of 2 documents.
#> text1 :
#> [1] "a" "b" "c" "d"
#>
#> text2 :
#> [1] "a" "e"
setdiff(toks[[1]], toks[[2]])
#> [1] "b" "c" "d"
setdiff(toks[[2]], toks[[1]])
#> [1] "e"
Upvotes: 0