Reputation: 17715
\\b
represents a word boundary. I don't understand why this operator has different effects depending on the character that follows. Example:
test1 <- 'aland islands'
test2 <- 'åland islands'
regex1 <- "[å|a]land islands"
regex2 <- "\\b[å|a]land islands"
grepl(regex1, test1, perl = TRUE)
[1] TRUE
grepl(regex2, test1, perl = TRUE)
[1] TRUE
grepl(regex1, test2, perl = TRUE)
[1] TRUE
grepl(regex2, test2, perl = TRUE)
[1] FALSE
This only seems to be an issue when perl = TRUE
:
grepl(regex1, test2, perl = FALSE)
[1] TRUE
grepl(regex2, test2, perl = FALSE)
[1] TRUE
Unfortunately, in my application, I absolutely need to keep perl=TRUE
.
Upvotes: 3
Views: 921
Reputation: 18950
This is a (known) glitch in R's regex subsystem and is related to the character encoding of the input and the system locale / built properties.
The R documentation on grep states (highlighting added):
The POSIX 1003.2 mode of gsub and gregexpr does not work correctly with repeated word-boundaries (e.g., pattern = "\b"). Use perl = TRUE for such matches (but that may not work as expected with non-ASCII inputs, as the meaning of ‘word’ is system-dependent).
Only gsub
and grepexpr
are mentioned here grepl
seems to be affected as well.
PERL=FALSE
as already discovered by you.stick with the PCRE (reference) regex using the *UCP
flag (Unicode mode|Unicode Character Properties), which changes the matching behavior so that Unicode alphanumerics are not considered as word boundaries:
Code Sample:
options(encoding = "UTF-8")
test1 <- 'aland islands'
test2 <- 'åland islands'
regex1 <- "[å|a]land islands"
regex2 <- "(*UCP)\\b[å|a]land islands"
grepl(regex1, test2, perl = TRUE)
#[1] TRUE
grepl(regex2, test2, perl = TRUE)
#[1] TRUE
grepl(regex1, test2, perl = TRUE)
#[1] TRUE
grepl(regex2, test2, perl = TRUE)
#[1] TRUE
grepl(regex1, test2, perl = FALSE)
#[1] TRUE
grepl(regex2, test2, perl = FALSE)
#[1] FALSE
Notes:
The 6th test, using TRE with the (*UCP) flag, fails grepl(regex2, test2, perl = FALSE)
The *UCP
flag does not work if R is not installed with Unicode support for PCRE (may be the case in some environments, e.g. some minimal Cloud/Docker installations).
What's really annoying is that R's behavior is inconsistent across platforms:
Test your original code with these online R environments:
Only test case 4 is FALSE: gepl(regex2, test2, perl = TRUE)
(Running R 3.3/3.4 on Linux?)
Test case 4 and 6 are FALSE (Running R 3.3-3.5 on Linux?)
Further readings:
Upvotes: 6