Charlie Morton
Charlie Morton

Reputation: 787

Python regular expressions 0 or more words from set

I have a big block of text within which I am trying to look for a phrase. The phrase can be structured in a number of different ways.

  1. First I want to look for a word from a set of words, let's call it set 1.
  2. After that there must be a space or comma (or maybe something else that separates words)
  3. Then there may be 0 or more words from set 2
  4. Again followed by the word separation characters as in point 2 above
  5. finally there should be a word from set 3

Ideally all of these should be in the same sentence.

set 1 = (Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)

set 2 = (for|to|of|full|a|be|complete|Internal)

set 3 = (renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

So I have this regex expression

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]*(for|to|of|full|a|be|complete|Internal)[ ,]*(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

Now this will match a phrase where there is 0 or 1 words from set 2 but not if there are multiples. e.g "provides a wonderful opportunity for someone to add their own stamp as the property needs complete renovation throughout."

as soon as I add in 'a' before 'complete' it fails. The same as if I add another 'complete'.

How do I specify to look for 0 or multiple words from a set?

Upvotes: 2

Views: 155

Answers (4)

francesco bergesio
francesco bergesio

Reputation: 11

You have to use this regex:

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,](for|to|of|full|a|be|complete|Internal)*[ ,](renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

Because you have one word from first set. After that you have one space or comma. Near you have 0 or more word from set 2. Then an other space or comma and finally one word from the last set.

Upvotes: 1

mrzasa
mrzasa

Reputation: 23327

Long alternatives in regular expressions can be quite slow. I'd suggest to take another approach. First segment the text (split to words) and the iterate over the array of words checking if subsequent sets of 3 words fulfil your requirements

Something like that (rather pseudocode than a real python):

def check(text):
  words = segment(text)
  for i in range(0, len(text)-2):
      check_word1(text[i]) and check_word1(text[i+1]) and check_word3(text[i+2])

Upvotes: 2

Jim Wright
Jim Wright

Reputation: 6058

Set 1: Matches any of the words in set 1 with 1 separator.

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]

Set 2: Matches any of the words in set 2 with 1 separator, 0 or more times.

((for|to|of|full|a|be|complete|Internal)[ ,])*

Set 3: Matches any of the words in set 3

(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

Full:

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]((for|to|of|full|a|be|complete|Internal)[ ,])*(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

Upvotes: 3

dquijada
dquijada

Reputation: 1699

Just in case you didn't know, you can use sites like https://regex101.com/ to test your regular expressions, and see why it works/it doesn't.

In this case, you need the "zero or more" (*) operator on your second group. The result would be:

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]*(for|to|of|full|a|be|complete|Internal)*[ ,]*(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

However, considering you probably want the words to be separated, just include the space on the operator (you can use a non-capturing group for that), resulting on:

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]*(?:(for|to|of|full|a|be|complete|Internal)[ ,]*)*(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

Upvotes: 0

Related Questions