Reputation: 192
I am trying to extract the first group of non-whitespace characters that follows an Arabic string for each text in a set of about 2,100 total texts. Some of these texts contain the string, while others do not. This would be a very easy task, using str_extract
from the stringr
package, if the string were in English. However, for some reason this function doesn't work when using an Arabic string within the look-behind pattern:
library(stringr)
test_texts <- c(
"My text كلمة containing some Arabic",
"My text كلمة again containing some Arabic",
"My text that doesn't contain any Arabic"
)
str_extract(test_texts, "(?<=text )\\S+")
# [1] "كلمة" "كلمة" "that"
str_extract(test_texts, "(?<=containing )\\S+")
# [1] "some" "some" NA
str_extract(test_texts, "(?<=كلمة )\\S+") #returns NAs even though string is there
# [1] NA NA NA
Note that this works if I'm not using a look-behind pattern:
str_extract(test_texts, "كلمة \\S+")
# [1] "كلمة containing" "كلمة again" NA
Why does the Arabic mess things up only when using a look-behind pattern?
I am using R version 3.2.3, on OS X 10.11.3, and stringr version 1.0.0.
Upvotes: 2
Views: 212
Reputation: 627082
It seems there is some issue how str_extract
processes the right-to-left text inside the positive lookbehind. As a workaround, you may use str_match
with a regex having a capturing group around the subpattern capture the value you need:
> res <- str_match(test_texts, "كلمة +(\\S+)")
> res[,2]
[1] "containing" "again" NA
This solution allows matching the non-whitespace chunk even if there are more than 1 space after the Arabic word.
Upvotes: 2
Reputation: 6659
You can grep for non-ascii characters like this:
str_extract(test_texts, "[^\001-\177]+")
[1] "كلمة" "كلمة" NA
str_extract(test_texts, "(?<=[^\001-\177] )\\S+")
[1] "containing" "again" NA
And this seems to work... just adding brackets to what you had. This may not be sufficient either since the characters can be in any order if they are in brackets.
str_extract(test_texts, "(?<=[كلمة] )\\S+")
[1] "containing" "again" NA
Upvotes: 0