MCC89
MCC89

Reputation: 67

Quanteda dfm_lookup using dictionaries with multi-word patterns/expressions

I am using a dictionary to identify usage of a particular set of words in a corpus. I have included multi-word patterns in the dictionary, however, I don't think dfm_lookup (from the quanteda package) matches multi-word expressions. Does anyone know how to do the same thing as dfm_lookup with a dictionary containing multi-word expressions?

library(quanteda)

BritainEN <- 
  dictionary(list(identity=c("British", "Great Britain")))


British <- dfm_lookup(debate_dfm,
                       BritishEN,case_insensitive=T)

Upvotes: 2

Views: 889

Answers (2)

Ken Benoit
Ken Benoit

Reputation: 14902

Yes - you need to use tokens_lookup() on the tokens before you form the dfm. Once you have tokenized individual words, they no longer exist as the ordered sequence you need to match the multi-word values in your dictionary. So 1) form the tokens object, 2) use tokens_lookup() to apply the dictionary to the tokens, and then 3) form the dfm.

library("quanteda")
#> Package version: 1.5.2

BritainEN <- 
    dictionary(list(identity = c("British", "Great Britain")))

txt <- c(doc1 = "Great Britain is a country.",
         doc2 = "British citizens live in Great Britain.")

tokens(txt) %>%
    tokens_lookup(dictionary = BritainEN, exclusive = FALSE)
#> tokens from 2 documents.
#> doc1 :
#> [1] "IDENTITY" "is"       "a"        "country"  "."       
#> 
#> doc2 :
#> [1] "IDENTITY" "citizens" "live"     "in"       "IDENTITY" "."

tokens(txt) %>%
    tokens_lookup(dictionary = BritainEN) %>%
    dfm()
#> Document-feature matrix of: 2 documents, 1 feature (0.0% sparse).
#> 2 x 1 sparse Matrix of class "dfm"
#>       features
#> docs   identity
#>   doc1        1
#>   doc2        2

Added

To answer the additional comment question and to extend on @phiver's very useful answer to this, there is also a nested_scope argument designed for matches that might occur within another MWE dictionary key's value.

Example:

library("quanteda")
## Package version: 1.5.2

Ireland_nested <- dictionary(list(
  ie_alone = "Ireland",
  ie_nested = "Northern Ireland"
))

txt <- c(
  doc1 = "Northern Ireland is a country.",
  doc2 = "Some citizens of Ireland live in Northern Ireland."
)

toks <- tokens(txt)

tokens_lookup(toks, dictionary = Ireland_nested, exclusive = FALSE)
## Tokens consisting of 2 documents.
## doc1 :
## [1] "IE_NESTED" "IE_ALONE"  "is"        "a"         "country"   "."        
## 
## doc2 :
## [1] "Some"      "citizens"  "of"        "IE_ALONE"  "live"      "in"       
## [7] "IE_NESTED" "IE_ALONE"  "."
tokens_lookup(toks,
  dictionary = Ireland_nested, nested_scope = "dictionary",
  exclusive = FALSE
)
## Tokens consisting of 2 documents.
## doc1 :
## [1] "IE_NESTED" "is"        "a"         "country"   "."        
## 
## doc2 :
## [1] "Some"      "citizens"  "of"        "IE_ALONE"  "live"      "in"       
## [7] "IE_NESTED" "."

The first matches both keys, since the nesting level is just within key, but the nesting pattern occurs in two different keys. (In @phiver's the patterns were nested within key, in my example they are not.) When nested_scope = "dictionary", then it looks for nested pattern matches across the entire dictionary, not just within key, so it is not duplicated in my example.

Which you choose depends on your purpose. We designed quanteda to have the defaults that most users would want and expect, but added additional options like this for those with specific needs. (And usually those needs are first expressed by Kohei or me working on a specific use case of our own!)

Upvotes: 4

phiver
phiver

Reputation: 23598

To answer your question in the comment:

How does this work if the dictionary contains a word which then also appears in a multi-word expression in the dictionary

If the text contains "Northern Ireland" and the dictionary contains both "Northern Ireland" and "Ireland" it will only be counted once but ONLY IF both values are in the same dictionary grouping, like in the British example in Ken's answer.

See examples below for the differences.

Example combined dictionary:

library("quanteda")

Ireland_combined <- 
  dictionary(list(identity = c("Ireland", "Northern Ireland")))

txt <- c(doc1 = "Northern Ireland is a country.",
         doc2 = "Some citizens of Ireland live in Northern Ireland.")

tokens(txt) %>%
  tokens_lookup(dictionary = Ireland_combined , exclusive = FALSE)

# tokens from 2 documents.
# doc1 :
# [1] "IDENTITY" "is"       "a"        "country"  "."       
#
# doc2 :
# [1] "Citizens" "of"       "IDENTITY" "live"     "in"       "IDENTITY" "."  


tokens(txt) %>%
  tokens_lookup(dictionary = Ireland_combined ) %>%
  dfm()

# Document-feature matrix of: 2 documents, 1 feature (0.0% sparse).
# 2 x 1 sparse Matrix of class "dfm"
#       features
# docs   identity
#   doc1        1
#   doc2        2

Example seperate dictionary entries:

Ireland_seperated <- 
  dictionary(list(identity1 = c("Ireland"),
                  identity2 = "Northern Ireland"))

tokens(txt) %>%
  tokens_lookup(dictionary = Ireland_seperated , exclusive = FALSE)

# tokens from 2 documents.
# doc1 :
# [1] "IDENTITY2" "IDENTITY1" "is"        "a"         "country"   "."        
# 
# doc2 :
# [1] "Citizens"  "of"        "IDENTITY1" "live"      "in"        "IDENTITY2" "IDENTITY1" "."      

tokens(txt) %>%
  tokens_lookup(dictionary = Ireland_seperated ) %>%
  dfm()

# Document-feature matrix of: 2 documents, 2 features (0.0% sparse).
# 2 x 2 sparse Matrix of class "dfm"
#       features
# docs   identity1 identity2
#   doc1         1         1
#   doc2         2         1

Upvotes: 4

Related Questions