Quanteda dfm_lookup using dictionaries with multi-word patterns/expressions

Question

I am using a dictionary to identify usage of a particular set of words in a corpus. I have included multi-word patterns in the dictionary, however, I don't think dfm_lookup (from the quanteda package) matches multi-word expressions. Does anyone know how to do the same thing as dfm_lookup with a dictionary containing multi-word expressions?

library(quanteda)

BritainEN <- 
  dictionary(list(identity=c("British", "Great Britain")))


British <- dfm_lookup(debate_dfm,
                       BritishEN,case_insensitive=T)

Ken Benoit · Accepted Answer

Yes - you need to use tokens_lookup() on the tokens before you form the dfm. Once you have tokenized individual words, they no longer exist as the ordered sequence you need to match the multi-word values in your dictionary. So 1) form the tokens object, 2) use tokens_lookup() to apply the dictionary to the tokens, and then 3) form the dfm.

library("quanteda")
#> Package version: 1.5.2

BritainEN <- 
    dictionary(list(identity = c("British", "Great Britain")))

txt <- c(doc1 = "Great Britain is a country.",
         doc2 = "British citizens live in Great Britain.")

tokens(txt) %>%
    tokens_lookup(dictionary = BritainEN, exclusive = FALSE)
#> tokens from 2 documents.
#> doc1 :
#> [1] "IDENTITY" "is"       "a"        "country"  "."       
#> 
#> doc2 :
#> [1] "IDENTITY" "citizens" "live"     "in"       "IDENTITY" "."

tokens(txt) %>%
    tokens_lookup(dictionary = BritainEN) %>%
    dfm()
#> Document-feature matrix of: 2 documents, 1 feature (0.0% sparse).
#> 2 x 1 sparse Matrix of class "dfm"
#>       features
#> docs   identity
#>   doc1        1
#>   doc2        2

Added

To answer the additional comment question and to extend on @phiver's very useful answer to this, there is also a nested_scope argument designed for matches that might occur within another MWE dictionary key's value.

Example:

library("quanteda")
## Package version: 1.5.2

Ireland_nested <- dictionary(list(
  ie_alone = "Ireland",
  ie_nested = "Northern Ireland"
))

txt <- c(
  doc1 = "Northern Ireland is a country.",
  doc2 = "Some citizens of Ireland live in Northern Ireland."
)

toks <- tokens(txt)

tokens_lookup(toks, dictionary = Ireland_nested, exclusive = FALSE)
## Tokens consisting of 2 documents.
## doc1 :
## [1] "IE_NESTED" "IE_ALONE"  "is"        "a"         "country"   "."        
## 
## doc2 :
## [1] "Some"      "citizens"  "of"        "IE_ALONE"  "live"      "in"       
## [7] "IE_NESTED" "IE_ALONE"  "."
tokens_lookup(toks,
  dictionary = Ireland_nested, nested_scope = "dictionary",
  exclusive = FALSE
)
## Tokens consisting of 2 documents.
## doc1 :
## [1] "IE_NESTED" "is"        "a"         "country"   "."        
## 
## doc2 :
## [1] "Some"      "citizens"  "of"        "IE_ALONE"  "live"      "in"       
## [7] "IE_NESTED" "."

The first matches both keys, since the nesting level is just within key, but the nesting pattern occurs in two different keys. (In @phiver's the patterns were nested within key, in my example they are not.) When nested_scope = "dictionary", then it looks for nested pattern matches across the entire dictionary, not just within key, so it is not duplicated in my example.

Which you choose depends on your purpose. We designed quanteda to have the defaults that most users would want and expect, but added additional options like this for those with specific needs. (And usually those needs are first expressed by Kohei or me working on a specific use case of our own!)

Quanteda dfm_lookup using dictionaries with multi-word patterns/expressions

Answers (2)

Related Questions