Dr. Fabian Habersack
Dr. Fabian Habersack

Reputation: 1141

Quanteda: How to look up patterns of two or more words in a phrase, when there can be any number of words in between?

I want to match some patterns in a text in R using the package {quanteda} and the tokens_lookup() function with the default valuetype="glob". The pattern would be the occurrence of one word in connection with a second word located anywhere in the same phrase.

library(quanteda)

text <- c(d1 = "apples word word word oranges", 
          d2 = "apples oranges", 
          d3 = "oranges and apples")

dict <- dictionary(list(fruits = c("apple* orange*")))

tokens(text) %>% 
  tokens_lookup(dict, valuetype = "glob") %>% 
  dfm()

Applying this dictionary to the tokenized text from above would yield a result of 0-1-0, while I would expect 1-1-0.

So my question would be, what's with blank spaces in glob pattern matching and shouldn't asterisks match everything including blank spaces? More generally, how can I match d1, d2, and potentially d3 with one and the same pattern?

EDIT:

In regex pattern matching this wouldn't be much of an issue. Example:

text <- c(d1 = "apples word word word oranges", 
          d2 = "apples oranges")

dict <- dictionary(list(fruits = c("apples.*oranges")))

tokens(text, what="sentence") %>%
  tokens_lookup(dict, valuetype = "regex") %>%
  dfm()

Upvotes: 2

Views: 815

Answers (1)

Ken Benoit
Ken Benoit

Reputation: 14902

tokens() segments on whitespace, and tokens_lookup() finds patterns in tokens - or in sequences of tokens if the pattern contains whitespace in your dictionary value. To use glob matching to get any token between two more specific patterns, you can specify a * as that part of the pattern. (Technically, patterns with whitespace are parsed into sequences called in quanteda-speak "phrases". See ?phrase.)

So to slightly modify your example:

library("quanteda")
## Package version: 2.0.1

text <- c(
  d1 = "apples word word oranges",
  d2 = "apples and oranges",
  d3 = "oranges and apples"
)

dict <- dictionary(list(fruits = c(
  "apple* * orange*",
  "apple* * * orange*"
)))

tokens(text) %>%
  tokens_lookup(dict, valuetype = "glob", exclusive = FALSE)
## Tokens consisting of 3 documents.
## d1 :
## [1] "FRUITS"
## 
## d2 :
## [1] "FRUITS"
## 
## d3 :
## [1] "oranges" "and"     "apples"

Here, we get the pattern of apple* followed by one of any token or two of any token, followed by orange*.

This will not pick up the "orange" followed by "apple" however since that's the reverse sequences, and will not pick up just "apple orange" since there is not some token between them. (But you could add that case by adding a third value to your fruits key as just `"apple* orange*".)

Upvotes: 1

Related Questions