Reputation: 1021
My question is a continuation of this.
After cleaning my text data and visualizing it using a wordcloud
, I want to see which words are correlated to each other. Here comes the problem:
quanteda
has the function textstat_simil
, but it says
similarity. So, are "similarity" and "correlation" in this case the same thing? (Is distance also related?).
Moreover, my dfm looks like a binary matrix. Is in this case phi
correlation (from chi'squared statistics) more indicated? Can I
calculate this via quanteda
?
Thanks for you patient!
Upvotes: 0
Views: 402
Reputation: 14902
To compute Pearson’s product-moment correlations among features, you would use:
textstat_simil(x, method = “correlation”, margin = “features”)
The documentation makes this pretty clear, and the correlation method is the default.
Pearson’s correlation would not be the most appropriate for binary data, and we currently do not implement Spearman’s or other correlation methods more appropriate for categorical or ordinal data. However you can always coerce the dfm to an ordinary matrix (use as.matrix()
) and then use the stats::cor()
methods, which include Spearman’s.
As for the last question, we use the standard implementation of these measures. If you want more clarity on what they mean, I suggest asking on Cross-Validated.
Upvotes: 2