Rafael
Rafael

Reputation: 3196

Regex with Chinese characters

I'm searching text_ which is: 本周(3月25日-3月31日),国内油厂开机率继续下降,全国各地油厂大豆压榨总量1456000吨(出粕1157520吨,出油262080吨),较上周的...[continued]

  crush <- str_extract(string = text_, pattern = perl("(?<=量).*(?=吨(出粕)"))
  meal <- str_extract(string = text_, pattern = perl("(?<=粕).*(?=吨,出)"))
  oil <-  str_extract(string = text_, pattern = perl("(?<=出油).*(?=吨))"))

prints

[1] "1456000"   ## correct
[1] "1157520"   ## correct
[1] NA          ## looking for 262080 here

Why do the first two match but not the last one? I'm using the stringr library.

Upvotes: 5

Views: 1676

Answers (4)

Rain.Wei
Rain.Wei

Reputation: 131

Did you check your ICU version. I met this problem before, ICU version of stringi is 55 that time, I tried to recompile stringi with ICU 58, and then stringr works fine with Chinese characters. Now stringi's new version is compiled with ICU which's version is newer than 60, the problem should be fixed.

Upvotes: 0

Rafael
Rafael

Reputation: 3196

For some reason, still don't know, I wasn't able to use @WiktorStribiżew 's commented solution, but this ended up working:

oil <-  str_extract(string = text_, pattern = perl("(?<=吨).*(?=吨)"))
# [1] "(出粕1157520吨,出油262080吨),较
oil <- str_extract(string = oil, pattern = perl("(?<=油)\\d+(?=吨)"))
# [1] 262080

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627103

Note that current version of stringr package is based on ICU regex library, and using perl() is deprecated.

Note that lookbehind patterns are fixed-width, and it seems that there is a problem with how ICU parses the first letter in your lookbehind pattern (it cannot calculate its width for some unknown reason).

Since you are using stringr, you may just rely on capturing that can be achieved with str_match, to extract a part of the pattern:

> match <- str_match(s, "出油(\\d+)吨")
> match[,2]
[1] "262080"

This way, you will avoid any eventual issues in the future. Also, these regexps are executed faster since there is no unanchored lookbehind in the pattern that is executed at every location in the searched string.

Also, you may just use your PCRE regex with base R:

> regmatches(s, regexpr("(?<=出油)\\d+(?=吨)", s, perl=TRUE))
[1] "262080"

Upvotes: 2

LONG
LONG

Reputation: 4620

Try this:

  oil <-  str_extract(string = text_, pattern = perl("(?<=出油).*(?=吨),较上周的))"))

Because simple could appear later again your text again, cannot precise locate which part, may exceed the data length or cause the data type issue.

Upvotes: 0

Related Questions