Reputation: 1141
I have two interrelated questions with respect to pattern matching in R using the package {quanteda}
and the tokens_lookup()
function with the default valuetype="glob"
(see here and here).
Say I wanted to match a German word which can be spelt slightly differently depending on whether it is singular or plural: "Apfel" (EN: apple), "Äpfel" (EN: apples). For the plural, we thus use the umlaut "ä" instead of "a" at the beginning. So if I look up tokens, I want to make sure that whether or not I find fruits in a text does not depend on whether the word I'm lokking for is singular or plural. This is a very simple example and I'm aware that I might as well build a dictionary that features "äpfel*" and "apfel*", but my question is more generally about the use of special characters like square brackets.
So in essence, I thought I could simply go with sqaure brackets similarly to regex pattern matching: [aä]
. More generally, I thought I could use things like [a-z]
to match any single letter from a to z or [0-9]
to match any single number between 0 and 9. In fact, that's what it says here. For some reason, none of that seems to work:
library(quanteda)
text <- c(d1 = "i like apples and apple pie",
d2 = "ich mag äpfel und apfelkuchen")
dict_1 <- dictionary(list(fruits = c("[aä]pfel*"))) # EITHER "a" OR "ä"
dict_2 <- dictionary(list(fruits = c("[a-z]pfel*"))) # ANY LETTER
tokens(text) %>%
tokens_lookup(dict_1, valuetype = "glob")
tokens(text) %>%
tokens_lookup(dict_2, valuetype = "glob")
1.) Is there a way to use square brackets at all in glob pattern matching?
2.) If so, would [a-z] also match umlauts (ä,ö,ü) or if not, how can we match characters like that?
Upvotes: 0
Views: 201
Reputation: 14902
1) No, you cannot use brackets with glob pattern matching. However, they work perfectly with regex pattern matching.
2) No, [a-z] will not match umlauts.
Here's how to do it, stripping away all from your question that is not necessary to answering the question.
library("quanteda")
## Package version: 2.0.1
text <- "Ich mag Äpfel und Apfelkuchen"
toks <- tokens(text)
dict_1 <- dictionary(list(fruits = c("[aä]pfel*")))
dict_2 <- dictionary(list(fruits = c("[a-z]pfel*")))
tokens_lookup(toks, dict_1, valuetype = "regex", exclusive = FALSE)
## Tokens consisting of 1 document.
## text1 :
## [1] "Ich" "mag" "FRUITS" "und" "FRUITS"
tokens_lookup(toks, dict_2, valuetype = "regex", exclusive = FALSE)
## Tokens consisting of 1 document.
## text1 :
## [1] "Ich" "mag" "Äpfel" "und" "FRUITS"
Note: No need to import all of the tidyverse just to get %>%
, as quanteda makes this available through re-export.
Upvotes: 1