Reputation: 6531
I would like to be able to use a regular expression to find any matches for a particular keyphrase within some text.
The keyphrase may or may not contain 1 or more spaces (it would usually only be 1 word, but in some cases may be multiple words).
I am currently using the following expression where the keyphrase is a single word (containing no spaces):
var regexPattern = string.Format( "\\b({0})\\b", keyphrase );
When the keyphrase is multiple words (contains one or more spaces), I am then updating the expression to replace any of those spaces with a wildcard:
regexPattern = regexPattern.Replace( " ", ".*" );
There are a couple of scenarios where this is not behaving as I need it to.
1) If the keyphrase within my long text (that I'm searching for matches) is surrounded by either an underscore or a numeric, it no longer matches. It's fine with hyphens, commas, full stops etc. In those scenarios, it still detects the keyphrase in there, but I also need it to match when the keyphrase is surrounded with underscores or numerics.
2) In the scenario where my keyphrase consists of multiple words (contains 1 or more spaces), I would like to allow up to a certain maximum distance/length between each of the words that form my keyphrase.
e.g. If my keyphrase is:
for sale
... and the text that I am matching against is
I have a bike for sale.
... (where there is up to a maximum distance of 5 characters between the keyphrase words), I would like the regex to match:
bike for sale
However, if there was more distance between the keyphrase words than 5 characters, I would not want it to match.
Also, this 'distance' shouldn't be confined to the number of spaces that occur between the keyphrase words, as I would also like the following to match for example:
I have a bike for _.,1sale.
Finally, it's probably worth stating that in some cases, the keyphrase I'm searching for may appear more than once, and where the above conditions are met, I'd need both to be matched:
e.g.
I have a bike for _.,1sale. I've also got a laptop for sale!
So, I essentially have 2 additional requirements on what I currently have, but don't know regular expressions well enough to know how I can implement these.
Upvotes: 4
Views: 1975
Reputation: 627390
I think you can use the following code to address 2 issues:
var regexPattern = string.Format( "(?<!\\p{{L}}){0}(?!\\p{{L}})", keyphrase );
// or
// var regexPattern = string.Format( "(?<=\\P{{L}}|^){0}(?=\\P{{L}}|$)", keyphrase );
regexPattern = regexPattern.Replace( " ", ".{0,5}" );
The regex will look like
(?<!\p{L})key.{0,5}word(?!\p{L})
or
(?<=\P{L}|^)key.{0,5}word(?=\P{L}|$)
Mind that if you want to also match the inner word boundaries the same way, use
regexPattern = regexPattern.Replace( " ", "(?=\\P{L}).{0,5}(?<=\\P{L})" );
Regex will be
(?<!\p{L})key(?=\P{L}).{0,5}(?<=\P{L})word(?!\p{L})
or
(?<=\P{L}|^)key(?=\P{L}).{0,5}(?<=\P{L})word(?=\P{L}|$)
See demo, it will exclude the cases where the 2 words won't match if glued.
Upvotes: 2