Guilherme Parreira
Guilherme Parreira

Reputation: 1021

Which methods can I use to calculate correlation among words in quanteda?

My question is a continuation of this.

After cleaning my text data and visualizing it using a wordcloud, I want to see which words are correlated to each other. Here comes the problem:

  1. quantedahas the function textstat_simil, but it says similarity. So, are "similarity" and "correlation" in this case the same thing? (Is distance also related?).

  2. Moreover, my dfm looks like a binary matrix. Is in this case phi correlation (from chi'squared statistics) more indicated? Can I calculate this via quanteda?

  3. Do you guys have any other content rather than the source code of github that explain in more detail the methods to calculate similarity or distance measures? (I couldn't understand from this code, sorry).

Thanks for you patient!

Upvotes: 0

Views: 402

Answers (1)

Ken Benoit
Ken Benoit

Reputation: 14902

To compute Pearson’s product-moment correlations among features, you would use:

textstat_simil(x, method = “correlation”, margin = “features”)

The documentation makes this pretty clear, and the correlation method is the default.

Pearson’s correlation would not be the most appropriate for binary data, and we currently do not implement Spearman’s or other correlation methods more appropriate for categorical or ordinal data. However you can always coerce the dfm to an ordinary matrix (use as.matrix()) and then use the stats::cor() methods, which include Spearman’s.

As for the last question, we use the standard implementation of these measures. If you want more clarity on what they mean, I suggest asking on Cross-Validated.

Upvotes: 2

Related Questions