Reputation: 1226
I have a data frame with around 100k rows that contain textual data. Using the quanteda package, I apply sentiment analysis (Lexicoder dictionary) to eventually calculate a sentiment score. For an additional - more qualitative - step of analysis I would like extract the top features (i.e. negative/positive words from the dictionary that occur most frequent in my data) to examine whether the discourse is driven by particular words.
my_corpus <- corpus(my_df, docid_field = "ID", text_field = "my_text", metacorpus = NULL, compress = FALSE)
sentiment_corp <- dfm(my_corpus, dictionary = data_dictionary_LSD2015)
However, going through the quanteda documentation, I couldn't figure out how to achieve this - is there a way?
I'm aware of topfeatures
and I did read this question, but it didn't help.
Upvotes: 4
Views: 547
Reputation: 14902
In all of the quanteda functions that take a pattern
argument, the valid types of patterns are character vectors, lists, and dictionaries. So the best way to assess each the top features in each dictionary category (what we also call a dictionary key) is to select on that dictionary and then use topfeatures()
.
Here is how to do this using the built-in data_corpus_irishbudget2010
object, as an example, with the Lexicoder Sentiment Dictionary.
library("quanteda")
## Package version: 1.4.3
# tokenize and select just the dictionary value matches
toks <- tokens(data_corpus_irishbudget2010) %>%
tokens_select(pattern = data_dictionary_LSD2015)
lapply(toks[1:5], head)
## $`Lenihan, Brian (FF)`
## [1] "severe" "distress" "difficulties" "recovery"
## [5] "benefit" "understanding"
##
## $`Bruton, Richard (FG)`
## [1] "failed" "warnings" "sucking" "losses" "debt" "hurt"
##
## $`Burton, Joan (LAB)`
## [1] "remarkable" "consensus" "Ireland" "opposition" "knife"
## [6] "dispute"
##
## $`Morgan, Arthur (SF)`
## [1] "worst" "worst" "well" "corrupt" "golden" "protected"
##
## $`Cowen, Brian (FF)`
## [1] "challenge" "succeeding" "challenge" "oppose"
## [5] "responsibility" "support"
To explore the top matches for the positive entry, we can select them further by subsetting the dictionary for the Positive key.
# top positive matches
tokens_select(toks, pattern = data_dictionary_LSD2015["positive"]) %>%
dfm() %>%
topfeatures()
## benefit support recovery fair create confidence
## 68 52 44 41 39 37
## provide well credit help
## 36 33 31 29
And for Negative:
# top negative matches
tokens_select(toks, pattern = data_dictionary_LSD2015[["negative"]]) %>%
dfm() %>%
topfeatures()
## ireland benefit not support crisis recovery
## 79 68 52 52 47 44
## fair create deficit confidence
## 41 39 38 37
Why is "Ireland" a negative match? Because the LSD2015 includes ir*
as a negative word that is intended to match ire and ireful but with the default case insensitive matching, also matches Ireland (a term frequently used in this example corpus). This is an example of a "false positive" match, always a risk in dictionaries when using wildcarding or when using a language such as English that has a very high rate of polysemes and homographs.
Upvotes: 3