Quanteda: How can I use square brackets with glob-style pattern matching using tokens_lookup?

Question

I have two interrelated questions with respect to pattern matching in R using the package {quanteda} and the tokens_lookup() function with the default valuetype="glob" (see here and here).

Say I wanted to match a German word which can be spelt slightly differently depending on whether it is singular or plural: "Apfel" (EN: apple), "Äpfel" (EN: apples). For the plural, we thus use the umlaut "ä" instead of "a" at the beginning. So if I look up tokens, I want to make sure that whether or not I find fruits in a text does not depend on whether the word I'm lokking for is singular or plural. This is a very simple example and I'm aware that I might as well build a dictionary that features "äpfel*" and "apfel*", but my question is more generally about the use of special characters like square brackets.

So in essence, I thought I could simply go with sqaure brackets similarly to regex pattern matching: [aä]. More generally, I thought I could use things like [a-z] to match any single letter from a to z or [0-9] to match any single number between 0 and 9. In fact, that's what it says here. For some reason, none of that seems to work:

library(quanteda)

text <- c(d1 = "i like apples and apple pie", 
          d2 = "ich mag äpfel und apfelkuchen")

dict_1 <- dictionary(list(fruits = c("[aä]pfel*")))      # EITHER "a" OR "ä"
dict_2 <- dictionary(list(fruits = c("[a-z]pfel*")))     # ANY LETTER

tokens(text) %>%
  tokens_lookup(dict_1, valuetype = "glob")

tokens(text) %>%
  tokens_lookup(dict_2, valuetype = "glob")

1.) Is there a way to use square brackets at all in glob pattern matching?

2.) If so, would [a-z] also match umlauts (ä,ö,ü) or if not, how can we match characters like that?

Ken Benoit · Accepted Answer

1) No, you cannot use brackets with glob pattern matching. However, they work perfectly with regex pattern matching.

2) No, [a-z] will not match umlauts.

Here's how to do it, stripping away all from your question that is not necessary to answering the question.

library("quanteda")
## Package version: 2.0.1

text <- "Ich mag Äpfel und Apfelkuchen"

toks <- tokens(text)

dict_1 <- dictionary(list(fruits = c("[aä]pfel*")))
dict_2 <- dictionary(list(fruits = c("[a-z]pfel*")))

tokens_lookup(toks, dict_1, valuetype = "regex", exclusive = FALSE)
## Tokens consisting of 1 document.
## text1 :
## [1] "Ich"    "mag"    "FRUITS" "und"    "FRUITS"
tokens_lookup(toks, dict_2, valuetype = "regex", exclusive = FALSE)
## Tokens consisting of 1 document.
## text1 :
## [1] "Ich"    "mag"    "Äpfel"  "und"    "FRUITS"

Note: No need to import all of the tidyverse just to get %>%, as quanteda makes this available through re-export.

Quanteda: How can I use square brackets with glob-style pattern matching using tokens_lookup?

Answers (1)

Related Questions