marcusstarnes
marcusstarnes

Reputation: 6531

Regex word boundaries and distance between matches

I would like to be able to use a regular expression to find any matches for a particular keyphrase within some text.

The keyphrase may or may not contain 1 or more spaces (it would usually only be 1 word, but in some cases may be multiple words).

I am currently using the following expression where the keyphrase is a single word (containing no spaces):

var regexPattern = string.Format( "\\b({0})\\b", keyphrase );

When the keyphrase is multiple words (contains one or more spaces), I am then updating the expression to replace any of those spaces with a wildcard:

regexPattern = regexPattern.Replace( " ", ".*" );

There are a couple of scenarios where this is not behaving as I need it to.

1) If the keyphrase within my long text (that I'm searching for matches) is surrounded by either an underscore or a numeric, it no longer matches. It's fine with hyphens, commas, full stops etc. In those scenarios, it still detects the keyphrase in there, but I also need it to match when the keyphrase is surrounded with underscores or numerics.

2) In the scenario where my keyphrase consists of multiple words (contains 1 or more spaces), I would like to allow up to a certain maximum distance/length between each of the words that form my keyphrase.

e.g. If my keyphrase is:

for sale

... and the text that I am matching against is

I have a bike for    sale.

... (where there is up to a maximum distance of 5 characters between the keyphrase words), I would like the regex to match:

bike for    sale

However, if there was more distance between the keyphrase words than 5 characters, I would not want it to match.

Also, this 'distance' shouldn't be confined to the number of spaces that occur between the keyphrase words, as I would also like the following to match for example:

I have a bike for _.,1sale.

Finally, it's probably worth stating that in some cases, the keyphrase I'm searching for may appear more than once, and where the above conditions are met, I'd need both to be matched:

e.g.

I have a bike for _.,1sale. I've also got a laptop for sale!

So, I essentially have 2 additional requirements on what I currently have, but don't know regular expressions well enough to know how I can implement these.

Upvotes: 4

Views: 1975

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627390

I think you can use the following code to address 2 issues:

var regexPattern = string.Format( "(?<!\\p{{L}}){0}(?!\\p{{L}})", keyphrase );
// or
// var regexPattern = string.Format( "(?<=\\P{{L}}|^){0}(?=\\P{{L}}|$)", keyphrase );
regexPattern = regexPattern.Replace( " ", ".{0,5}" );

The regex will look like

(?<!\p{L})key.{0,5}word(?!\p{L})

or

(?<=\P{L}|^)key.{0,5}word(?=\P{L}|$)

Here is demo 1 / demo 2

Mind that if you want to also match the inner word boundaries the same way, use

regexPattern = regexPattern.Replace( " ", "(?=\\P{L}).{0,5}(?<=\\P{L})" );

Regex will be

(?<!\p{L})key(?=\P{L}).{0,5}(?<=\P{L})word(?!\p{L})

or

(?<=\P{L}|^)key(?=\P{L}).{0,5}(?<=\P{L})word(?=\P{L}|$)

See demo, it will exclude the cases where the 2 words won't match if glued.

Upvotes: 2

Related Questions