Reputation: 192

Positive look-behind in R that includes non-ascii characters

I am trying to extract the first group of non-whitespace characters that follows an Arabic string for each text in a set of about 2,100 total texts. Some of these texts contain the string, while others do not. This would be a very easy task, using str_extract from the stringr package, if the string were in English. However, for some reason this function doesn't work when using an Arabic string within the look-behind pattern:

library(stringr)
test_texts <- c(
    "My text كلمة containing some Arabic",
    "My text كلمة again containing some Arabic",
    "My text that doesn't contain any Arabic"
)
str_extract(test_texts, "(?<=text )\\S+")
# [1] "كلمة" "كلمة" "that"
str_extract(test_texts, "(?<=containing )\\S+")
# [1] "some" "some" NA    
str_extract(test_texts, "(?<=كلمة )\\S+") #returns NAs even though string is there
# [1] NA NA NA

Note that this works if I'm not using a look-behind pattern:

str_extract(test_texts, "كلمة \\S+")
# [1] "كلمة containing" "كلمة again"      NA

Why does the Arabic mess things up only when using a look-behind pattern?

I am using R version 3.2.3, on OS X 10.11.3, and stringr version 1.0.0.

Upvotes: 2

Answers (2)

Wiktor Stribiżew

Reputation: 627082

It seems there is some issue how str_extract processes the right-to-left text inside the positive lookbehind. As a workaround, you may use str_match with a regex having a capturing group around the subpattern capture the value you need:

> res <- str_match(test_texts, "كلمة +(\\S+)")
> res[,2]
[1] "containing" "again"      NA

This solution allows matching the non-whitespace chunk even if there are more than 1 space after the Arabic word.

Upvotes: 2

cory

Reputation: 6659

You can grep for non-ascii characters like this:

str_extract(test_texts, "[^\001-\177]+")
[1] "كلمة" "كلمة" NA   

str_extract(test_texts, "(?<=[^\001-\177] )\\S+")
[1] "containing" "again"      NA

And this seems to work... just adding brackets to what you had. This may not be sufficient either since the characters can be in any order if they are in brackets.

str_extract(test_texts, "(?<=[كلمة] )\\S+")
[1] "containing" "again"      NA

Upvotes: 0

Positive look-behind in R that includes non-ascii characters

Answers (2)

Related Questions