trotta
trotta

Reputation: 1226

Extract top positive and negative features when applying dictionary in quanteda

I have a data frame with around 100k rows that contain textual data. Using the quanteda package, I apply sentiment analysis (Lexicoder dictionary) to eventually calculate a sentiment score. For an additional - more qualitative - step of analysis I would like extract the top features (i.e. negative/positive words from the dictionary that occur most frequent in my data) to examine whether the discourse is driven by particular words.

my_corpus <- corpus(my_df, docid_field = "ID", text_field = "my_text", metacorpus = NULL, compress = FALSE)
sentiment_corp <- dfm(my_corpus, dictionary = data_dictionary_LSD2015)

However, going through the quanteda documentation, I couldn't figure out how to achieve this - is there a way? I'm aware of topfeatures and I did read this question, but it didn't help.

Upvotes: 4

Views: 547

Answers (1)

Ken Benoit
Ken Benoit

Reputation: 14902

In all of the quanteda functions that take a pattern argument, the valid types of patterns are character vectors, lists, and dictionaries. So the best way to assess each the top features in each dictionary category (what we also call a dictionary key) is to select on that dictionary and then use topfeatures().

Here is how to do this using the built-in data_corpus_irishbudget2010 object, as an example, with the Lexicoder Sentiment Dictionary.

library("quanteda")
## Package version: 1.4.3

# tokenize and select just the dictionary value matches
toks <- tokens(data_corpus_irishbudget2010) %>%
  tokens_select(pattern = data_dictionary_LSD2015)
lapply(toks[1:5], head)
## $`Lenihan, Brian (FF)`
## [1] "severe"        "distress"      "difficulties"  "recovery"     
## [5] "benefit"       "understanding"
## 
## $`Bruton, Richard (FG)`
## [1] "failed"   "warnings" "sucking"  "losses"   "debt"     "hurt"    
## 
## $`Burton, Joan (LAB)`
## [1] "remarkable" "consensus"  "Ireland"    "opposition" "knife"     
## [6] "dispute"   
## 
## $`Morgan, Arthur (SF)`
## [1] "worst"     "worst"     "well"      "corrupt"   "golden"    "protected"
## 
## $`Cowen, Brian (FF)`
## [1] "challenge"      "succeeding"     "challenge"      "oppose"        
## [5] "responsibility" "support"

To explore the top matches for the positive entry, we can select them further by subsetting the dictionary for the Positive key.

# top positive matches
tokens_select(toks, pattern = data_dictionary_LSD2015["positive"]) %>%
  dfm() %>%
  topfeatures()
##    benefit    support   recovery       fair     create confidence 
##         68         52         44         41         39         37 
##    provide       well     credit       help 
##         36         33         31         29

And for Negative:

# top negative matches
tokens_select(toks, pattern = data_dictionary_LSD2015[["negative"]]) %>%
  dfm() %>%
  topfeatures()
##    ireland    benefit        not    support     crisis   recovery 
##         79         68         52         52         47         44 
##       fair     create    deficit confidence 
##         41         39         38         37

Why is "Ireland" a negative match? Because the LSD2015 includes ir* as a negative word that is intended to match ire and ireful but with the default case insensitive matching, also matches Ireland (a term frequently used in this example corpus). This is an example of a "false positive" match, always a risk in dictionaries when using wildcarding or when using a language such as English that has a very high rate of polysemes and homographs.

Upvotes: 3

Related Questions