When does locale affect R's regular expressions?

Question

R has several special locale-independent character classes for regular expressions.

From ?regex:

‘[[:alnum:]]’ means ‘[0-9A-Za-z]’, except the latter depends upon the locale and the character encoding, whereas the former is independent of locale and character set.

I'd like to know when locale-specific problems can occur.

I tried two examples based on the information in the ?Comparison help page, that describes how strings are sorted:

in Estonian ‘Z’ comes between ‘S’ and ‘T’

and

in Danish ‘aa’ sorts as a single letter, after ‘z’

In the first example, I would expect T, U, V, W, X and Y not to match. In the second example, I would expect aa not to match.

Sys.setlocale("LC_ALL", "Estonian")
grepl("[A-Z]", LETTERS)

Sys.setlocale("LC_ALL", "Danish")
grepl("[a-z]", "aa")

Since all values return TRUE, it seems that locale is not a problem here.

Can you find an example where locale causes traditional regular expression classes like [a-z] to fail?

UPDATE: I have a partial answer: accented roman characters behave differently using [a-zA-Z] vs. [[:alpha:]]. I'm still interested to know if there are more examples of differences, and whether locale or encoding affect matching of non-roman characters, and indeed, how you match non-roman characters.

When does locale affect R's regular expressions?

Answers (1)

Related Questions

When does locale affect R&#39;s regular expressions?

Answers (1)

Related Questions

When does locale affect R's regular expressions?