Reputation: 3196
I'm searching text_
which is: 本周(3月25日-3月31日),国内油厂开机率继续下降,全国各地油厂大豆压榨总量1456000吨(出粕1157520吨,出油262080吨),较上周的...[continued]
crush <- str_extract(string = text_, pattern = perl("(?<=量).*(?=吨(出粕)"))
meal <- str_extract(string = text_, pattern = perl("(?<=粕).*(?=吨,出)"))
oil <- str_extract(string = text_, pattern = perl("(?<=出油).*(?=吨))"))
prints
[1] "1456000" ## correct
[1] "1157520" ## correct
[1] NA ## looking for 262080 here
Why do the first two match but not the last one? I'm using the stringr
library.
Upvotes: 5
Views: 1676
Reputation: 131
Did you check your ICU version. I met this problem before, ICU version of stringi is 55 that time, I tried to recompile stringi with ICU 58, and then stringr works fine with Chinese characters. Now stringi's new version is compiled with ICU which's version is newer than 60, the problem should be fixed.
Upvotes: 0
Reputation: 3196
For some reason, still don't know, I wasn't able to use @WiktorStribiżew 's commented solution, but this ended up working:
oil <- str_extract(string = text_, pattern = perl("(?<=吨).*(?=吨)"))
# [1] "(出粕1157520吨,出油262080吨),较
oil <- str_extract(string = oil, pattern = perl("(?<=油)\\d+(?=吨)"))
# [1] 262080
Upvotes: 1
Reputation: 627103
Note that current version of stringr
package is based on ICU regex library, and using perl()
is deprecated.
Note that lookbehind patterns are fixed-width, and it seems that there is a problem with how ICU parses the first letter in your lookbehind pattern (it cannot calculate its width for some unknown reason).
Since you are using stringr
, you may just rely on capturing that can be achieved with str_match
, to extract a part of the pattern:
> match <- str_match(s, "出油(\\d+)吨")
> match[,2]
[1] "262080"
This way, you will avoid any eventual issues in the future. Also, these regexps are executed faster since there is no unanchored lookbehind in the pattern that is executed at every location in the searched string.
Also, you may just use your PCRE regex with base R:
> regmatches(s, regexpr("(?<=出油)\\d+(?=吨)", s, perl=TRUE))
[1] "262080"
Upvotes: 2
Reputation: 4620
Try this:
oil <- str_extract(string = text_, pattern = perl("(?<=出油).*(?=吨),较上周的))"))
Because simple 吨
could appear later again your text again, cannot precise locate which part, may exceed the data length or cause the data type issue.
Upvotes: 0