Richie Cotton
Richie Cotton

Reputation: 121127

When does locale affect R's regular expressions?

R has several special locale-independent character classes for regular expressions.

From ?regex:

‘[[:alnum:]]’ means ‘[0-9A-Za-z]’, except the latter depends upon the locale and the character encoding, whereas the former is independent of locale and character set.

I'd like to know when locale-specific problems can occur.

I tried two examples based on the information in the ?Comparison help page, that describes how strings are sorted:

in Estonian ‘Z’ comes between ‘S’ and ‘T’

and

in Danish ‘aa’ sorts as a single letter, after ‘z’

In the first example, I would expect T, U, V, W, X and Y not to match. In the second example, I would expect aa not to match.

Sys.setlocale("LC_ALL", "Estonian")
grepl("[A-Z]", LETTERS)

Sys.setlocale("LC_ALL", "Danish")
grepl("[a-z]", "aa")  

Since all values return TRUE, it seems that locale is not a problem here.

Can you find an example where locale causes traditional regular expression classes like [a-z] to fail?

UPDATE: I have a partial answer: accented roman characters behave differently using [a-zA-Z] vs. [[:alpha:]]. I'm still interested to know if there are more examples of differences, and whether locale or encoding affect matching of non-roman characters, and indeed, how you match non-roman characters.

Upvotes: 8

Views: 312

Answers (1)

Richie Cotton
Richie Cotton

Reputation: 121127

It seems that there is a difference in behaviour for accented roman characters.

grepl("[a-zA-Z]", c("å", "é"))
## [1] FALSE FALSE
grepl("[[:alpha:]]", c("å", "é"))
## [1]  TRUE  TRUE

Oddly, non-roman characters fail to match for either character class (at least in the few locales and encodings that I tried).

mu <- "\U03BC"
ya <- "\U044F"
jeem <- "\U062C"
grepl("[a-zA-Z]+", c(mu, ya, jeem))
## [1] FALSE FALSE FALSE
grepl("[[:alpha:]]", c(mu, ya, jeem))
## [1] FALSE FALSE FALSE

Upvotes: 2

Related Questions