Vincent
Vincent

Reputation: 17715

R regular expression: using \\b with 'Å' vs. 'A' characters

\\b represents a word boundary. I don't understand why this operator has different effects depending on the character that follows. Example:

test1 <- 'aland islands'
test2 <- 'åland islands'

regex1 <- "[å|a]land islands"
regex2 <- "\\b[å|a]land islands"

grepl(regex1, test1, perl = TRUE)
[1] TRUE
grepl(regex2, test1, perl = TRUE)
[1] TRUE

grepl(regex1, test2, perl = TRUE)
[1] TRUE
grepl(regex2, test2, perl = TRUE)
[1] FALSE

This only seems to be an issue when perl = TRUE:

grepl(regex1, test2, perl = FALSE)
[1] TRUE
grepl(regex2, test2, perl = FALSE)
[1] TRUE

Unfortunately, in my application, I absolutely need to keep perl=TRUE.

Upvotes: 3

Views: 921

Answers (1)

wp78de
wp78de

Reputation: 18950

This is a (known) glitch in R's regex subsystem and is related to the character encoding of the input and the system locale / built properties.

The R documentation on grep states (highlighting added):

The POSIX 1003.2 mode of gsub and gregexpr does not work correctly with repeated word-boundaries (e.g., pattern = "\b"). Use perl = TRUE for such matches (but that may not work as expected with non-ASCII inputs, as the meaning of ‘word’ is system-dependent).

Only gsub and grepexpr are mentioned here grepl seems to be affected as well.

Possible soutions

  • using R's default (TRE reference) regex engine: PERL=FALSE as already discovered by you.
  • stick with the PCRE (reference) regex using the *UCP flag (Unicode mode|Unicode Character Properties), which changes the matching behavior so that Unicode alphanumerics are not considered as word boundaries:

    Code Sample:

    options(encoding = "UTF-8")
    
    test1 <- 'aland islands'
    test2 <- 'åland islands'
    regex1 <- "[å|a]land islands"
    regex2 <- "(*UCP)\\b[å|a]land islands"    
    grepl(regex1, test2, perl = TRUE)
    #[1] TRUE
    grepl(regex2, test2, perl = TRUE)
    #[1] TRUE
    grepl(regex1, test2, perl = TRUE)
    #[1] TRUE
    grepl(regex2, test2, perl = TRUE)
    #[1] TRUE
    grepl(regex1, test2, perl = FALSE)
    #[1] TRUE
    grepl(regex2, test2, perl = FALSE)
    #[1] FALSE
    

    Online Demo

    Notes:

    • The 6th test, using TRE with the (*UCP) flag, fails grepl(regex2, test2, perl = FALSE)

    • The *UCP flag does not work if R is not installed with Unicode support for PCRE (may be the case in some environments, e.g. some minimal Cloud/Docker installations).


What's really annoying is that R's behavior is inconsistent across platforms:

  • Works as expected on current 64bit Windows (10)
  • May work on current Linux distros

Test your original code with these online R environments:

  • tutorialspoint or
  • Ideone

    Only test case 4 is FALSE: gepl(regex2, test2, perl = TRUE)
    (Running R 3.3/3.4 on Linux?)

  • JDoodle

    Test case 4 and 6 are FALSE (Running R 3.3-3.5 on Linux?)


Further readings:

Upvotes: 6

Related Questions