Gustavo Bernardino
Gustavo Bernardino

Reputation: 1

Quanteda: display the actual difference between texts

I managed to calculate the difference between two texts with the cosine method. With the following:

    library("quanteda")
dfmat <- corpus_subset(corpusnew) %>%
    tokens(remove_punct = TRUE) %>%
    tokens_remove(stopwords("portuguese")) %>%
    dfm()
(tstat1 <- textstat_simil(dfmat, method = "cosine", margin = "documents"))
as.matrix(tstat1)

And I get the following matrix:

       text1 text2 text3 text4 text5 
text1 1.000 0.801 0.801 0.801 0.798 

However, I would like to know the actual words that account for the difference and not by how much they differ or are alike. Is there a way?

Thanks

Upvotes: 0

Views: 215

Answers (2)

Ken Benoit
Ken Benoit

Reputation: 14902

This question only has pairwise answers, since each computation of similarity occurs between a single pair of documents. It's also not entirely clear what output you want to see, so I'll take my best guess and demonstrate a few possibilities.

So if you wanted to the features most different between text1 and text2, for instance, you could slice the documents you want to compare from the dfm, and then change margin = "features" to get the similarity of the document across features.

library("quanteda")
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.

dfmat <- tokens(data_corpus_inaugural[1:5], remove_punct = TRUE) %>%
    tokens_remove(stopwords("en")) %>%
    dfm()

library("quanteda.textstats")
sim <- textstat_simil(dfmat[1:2, ], margin = "features", method = "cosine")

Now we can examine the pairwise similarities (greatest and smallest) by converting the similarity matrix to a data.frame, and sorting it.

# most similar features
as.data.frame(sim) %>%
    dplyr::arrange(desc(cosine)) %>%
    dplyr::filter(cosine < 1) %>%
    head(10)
#>    feature1   feature2    cosine
#> 1   present        may 0.9994801
#> 2   country        may 0.9994801
#> 3       may government 0.9991681
#> 4   present   citizens 0.9988681
#> 5   country   citizens 0.9988681
#> 6   present     people 0.9988681
#> 7   country     people 0.9988681
#> 8   present     united 0.9988681
#> 9   country     united 0.9988681
#> 10  present government 0.9973337
    
# most different features
as.data.frame(sim) %>%
    dplyr::arrange(cosine) %>%
    head(10)
#>      feature1   feature2    cosine
#> 1  government       upon 0.1240347
#> 2  government      chief 0.1240347
#> 3  government magistrate 0.1240347
#> 4  government     proper 0.1240347
#> 5  government     arrive 0.1240347
#> 6  government   endeavor 0.1240347
#> 7  government    express 0.1240347
#> 8  government       high 0.1240347
#> 9  government      sense 0.1240347
#> 10 government  entertain 0.1240347

Created on 2022-03-08 by the reprex package (v2.0.1)

There are other ways to compare the words most different between documents, such as "keyness" - for instance quanteda.textstats::textstat_keyness() between text1 and text2, where the head and tail of the resulting data.frame will tell you the most dissimilar features.

Upvotes: 0

Kohei Watanabe
Kohei Watanabe

Reputation: 890

How about comparing tokens using setdiff()?

require(quanteda)
toks <- tokens(corpus(c("a b c d", "a e")))
toks
#> Tokens consisting of 2 documents.
#> text1 :
#> [1] "a" "b" "c" "d"
#> 
#> text2 :
#> [1] "a" "e"

setdiff(toks[[1]], toks[[2]])
#> [1] "b" "c" "d"
setdiff(toks[[2]], toks[[1]])
#> [1] "e"

Upvotes: 0

Related Questions