NutellaAddict
NutellaAddict

Reputation: 584

Regex match only if multiple patterns found (python)

I'm trying to extract data from sentences such as:

"monthly payment of 525 and 5000 drive off"

using a python regex search function: re.search()

My regex query string is as follows for down payment:

match1 = "(?P<down_payment>\d+)\s*(|\$|dollars*|money)*\s*" + \
         "(down|drive(\s|-)*off|due\s*at\s*signing|drive\s*-*\s*off)*"

My problem is that it matches the wrong numerical value as down payment, it gets both 525, and 5000.

How can I improve my regex string such that it only matches an element if another element is successfully matched as well?

In this case, for example, both 5000 and drive-off matched so we can extract 5000 as down_payment, but 525 did not match with the any down payment values, so it should not even consider the 525.

Clearer explanation here

Upvotes: 3

Views: 315

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

The point is that you want to match a sequence of patterns. In order to make sure the trailing patterns are taken into account, they cannot be all optional. Look, \s*, (|\$|dollars*|money)*, \s*, (down|drive(\s|-)*off|due\s*at\s*signing|drive\s*-*\s*off)* can match empty strings.

I suggest removing the final * quantifier to match exactly one occurrence of the pattern:

(?P<down_payment>\d+)\s*(?:\$|dollars*|money)?\s*(down|drive[\s-]*off|due\s*at\s*signing|drive\s*-*\s*off)

See the regex demo

Also note that I contracted a (\s|-) group into a character class [\s-] as you only alternate single char patterns, and also turned (|\$|dollars*|money)* into a non-capturing optional group (?:\$|dollars*|money)? that matches just 1 or 0 occurrences of $, dollar(s) or money.

Upvotes: 2

Related Questions