Papa Analytica
Papa Analytica

Reputation: 187

Limiting the scope of lookaround

I'm using regex engine in R and I want to ask regex to lookaround an specific word no more than 3-8 words. how can I do it?

if you need more detail, I'm trying to extract the degree (Mild/Moderate/Severe) of an specific type of heart dysfunction (systolic dysfunction) from a huge number of echo reports. each heart has 2 ventricles, what I want to extract is the systolic dysfunction of the left ventricle(lv) and not the right ventricle(rv)

so yes to: "enlarged lv chamber with some degree of mild to moderate systolic dysfunction" and no to: "enlarged rv chamber with some degree of mild to moderate systolic dysfunction"

in echo reports both rv and lv dysfunctions are discussed, so I naturally

would want to use lookarounds to exclude cases where there is an "rv" somewhere with a range of 3-8 words around the for example "mild systolic dysfunction"

I tried lookbehind like this

(?<!rv(\\s+\\w+\\s+){3,8})

but I get the following error:

"Look-Behind pattern matches must have a bounded maximum length"

P.S: I'm using stringr

the code I used is like this:

lv_systolic_dysfunction <- "(?i)(?<!rv(\\s+\\w+\\s+){3,8}))\\b(?!lv\\b)((?:\\w+\\s+to\\s+)?\\w+)\\b(?=(?:\\s+lv)?\\s+s[yi]stolic\\s+d[yi]sfunction)"

Upvotes: 1

Views: 64

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626946

You need to make sure the lookbehind patterns are of "bounded maximum length" by using only the limiting quantifiers in the lookbehind as + quantifier matches one or more occurrences. While it restricts the lower bound (at 1) it does not restrict the upper bound.

See a sample R demo:

library(dplyr)
library(stringr)
df <- tibble(test = c("normal rv with mild to moderate systolic dysfunction"))
lv_systolic_dysfunction <- "(?<!\\brv(?:\\s{1,100}\\w{1,100}){3,5}\\s{1,100})\\bmild to moderate\\b"
str_view_all(df$test, lv_systolic_dysfunction)

See what this regex means here. \s{1,100} matches one to one hundred whitespaces and \w{1,100} matches 1 to 100 word characters, this is what is meant by "bounded", capped from below and above. The numbers are arbitrary, just follow the common sense and your data when choosing them. It is not likely to have more than 2 spaces in between words in normal texts (I set 100 here just as an extreme example). It is quite sufficient for a word pattern to allow 100 chars. Adjust as you see fit.

Output:

enter image description here

See a YT demo of this constrained-width lookbehind feature in ICU regex flavor that is used in R stringr regex functions.

Upvotes: 2

Related Questions