Reputation: 121127
R has several special locale-independent character classes for regular expressions.
From ?regex
:
‘[[:alnum:]]’ means ‘[0-9A-Za-z]’, except the latter depends upon the locale and the character encoding, whereas the former is independent of locale and character set.
I'd like to know when locale-specific problems can occur.
I tried two examples based on the information in the ?Comparison
help page, that describes how strings are sorted:
in Estonian ‘Z’ comes between ‘S’ and ‘T’
and
in Danish ‘aa’ sorts as a single letter, after ‘z’
In the first example, I would expect T, U, V, W, X and Y not to match. In the second example, I would expect aa not to match.
Sys.setlocale("LC_ALL", "Estonian")
grepl("[A-Z]", LETTERS)
Sys.setlocale("LC_ALL", "Danish")
grepl("[a-z]", "aa")
Since all values return TRUE
, it seems that locale is not a problem here.
Can you find an example where locale causes traditional regular expression classes like [a-z]
to fail?
UPDATE: I have a partial answer: accented roman characters behave differently using [a-zA-Z]
vs. [[:alpha:]]
. I'm still interested to know if there are more examples of differences, and whether locale or encoding affect matching of non-roman characters, and indeed, how you match non-roman characters.
Upvotes: 8
Views: 312
Reputation: 121127
It seems that there is a difference in behaviour for accented roman characters.
grepl("[a-zA-Z]", c("å", "é"))
## [1] FALSE FALSE
grepl("[[:alpha:]]", c("å", "é"))
## [1] TRUE TRUE
Oddly, non-roman characters fail to match for either character class (at least in the few locales and encodings that I tried).
mu <- "\U03BC"
ya <- "\U044F"
jeem <- "\U062C"
grepl("[a-zA-Z]+", c(mu, ya, jeem))
## [1] FALSE FALSE FALSE
grepl("[[:alpha:]]", c(mu, ya, jeem))
## [1] FALSE FALSE FALSE
Upvotes: 2