Reputation: 584
I'm trying to extract data from sentences such as:
"monthly payment of 525 and 5000 drive off"
using a python regex search function: re.search()
My regex query string is as follows for down payment:
match1 = "(?P<down_payment>\d+)\s*(|\$|dollars*|money)*\s*" + \
"(down|drive(\s|-)*off|due\s*at\s*signing|drive\s*-*\s*off)*"
My problem is that it matches the wrong numerical value as down payment, it gets both 525, and 5000.
How can I improve my regex string such that it only matches an element if another element is successfully matched as well?
In this case, for example, both 5000 and drive-off matched so we can extract 5000 as down_payment, but 525 did not match with the any down payment values, so it should not even consider the 525.
Upvotes: 3
Views: 315
Reputation: 626738
The point is that you want to match a sequence of patterns. In order to make sure the trailing patterns are taken into account, they cannot be all optional. Look, \s*
, (|\$|dollars*|money)*
, \s*
, (down|drive(\s|-)*off|due\s*at\s*signing|drive\s*-*\s*off)*
can match empty strings.
I suggest removing the final *
quantifier to match exactly one occurrence of the pattern:
(?P<down_payment>\d+)\s*(?:\$|dollars*|money)?\s*(down|drive[\s-]*off|due\s*at\s*signing|drive\s*-*\s*off)
See the regex demo
Also note that I contracted a (\s|-)
group into a character class [\s-]
as you only alternate single char patterns, and also turned (|\$|dollars*|money)*
into a non-capturing optional group (?:\$|dollars*|money)?
that matches just 1 or 0 occurrences of $
, dollar(s)
or money
.
Upvotes: 2