Discard longer dictionary matches which contain a nested target word

Question

I am using tokens_lookup to see whether some texts contain the words in my dictionary. Now I am trying to find a way to discard the matches that occur when the dictionary word is in an ordered sequence of words. To make an example, suppose that Ireland is in the dictionary. I would like to exclude the cases where, for instance, Northern Ireland is mentioned (or any fixed set of words that contains Britain). The only indirect solution that I figured out is to build another dictionary with these sets of words (e.g. Great Britain). However, this solution would not work when both Britain and Great Britain are cited. Thank you.

library("quanteda")

dict <- dictionary(list(IE = "Ireland"))

txt <- c(
  doc1 = "Ireland lorem ipsum",
  doc2 = "Lorem ipsum Northern Ireland",
  doc3 = "Ireland lorem ipsum Northern Ireland"
)

toks <- tokens(txt)

tokens_lookup(toks, dictionary = dict)

Ken Benoit · Accepted Answer

You can do this by specifying another dictionary key for "Northern Ireland", with the value also "Northern Ireland". If you use the argument nested_scope = "dictionary" in tokens_lookup(), then this will match the longer phrase first and only once, separating "Ireland" from "Northern Ireland". By using the same key as the value, you replace it like for like (with the side benefit of now having these two tokens, "Northern" and "Ireland", combined a single token.

library("quanteda")
## Package version: 3.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

dict <- dictionary(list(IE = "Ireland", "Northern Ireland" = "Northern Ireland"))

txt <- c(
  doc1 = "Ireland lorem ipsum",
  doc2 = "Lorem ipsum Northern Ireland",
  doc3 = "Ireland lorem ipsum Northern Ireland"
)

toks <- tokens(txt)

tokens_lookup(toks,
  dictionary = dict, exclusive = FALSE,
  nested_scope = "dictionary", capkeys = FALSE
)
## Tokens consisting of 3 documents.
## doc1 :
## [1] "IE"    "lorem" "ipsum"
## 
## doc2 :
## [1] "Lorem"            "ipsum"            "Northern Ireland"
## 
## doc3 :
## [1] "IE"               "lorem"            "ipsum"            "Northern Ireland"

Here I used exclusive = FALSE for illustration purposes, so you could see what got looked up and replaced. You can remove that and the capkeys argument when you run it.

If you want to discard the "Northern Ireland" tokens, just use

tokens_lookup(toks, dictionary = dict, nested_scope = "dictionary") %>%
  tokens_remove("Northern Ireland")
## Tokens consisting of 3 documents.
## doc1 :
## [1] "IE"
## 
## doc2 :
## character(0)
## 
## doc3 :
## [1] "IE"

Discard longer dictionary matches which contain a nested target word

Answers (1)

Related Questions