Argo
Argo

Reputation: 103

Match prices using regular expression, but with exceptions

I have a string:

foo bar $ 123.456 bar foo $ 652 $ 1.255.250 bar $ 2.000 foo badword $ 300.000 foo bar $ 123 badword2 $ 400

And I want to match all the prices, except the ones which follows a "badword".

Match:

123.456
652
1.255.250
2.000
123

Do not match:

badword $ 300.000
badword2 $ 400

I'm developing in Python 3.6 and using (\d+).(\d+) to capture the prices so far.

Upvotes: 0

Views: 61

Answers (2)

The fourth bird
The fourth bird

Reputation: 163277

The pattern (\d+).(\d+) will capture one or more digits in capture group 1 and group 2 and the dot would match any character. That would also match 123a456

One option to capture the prices is to match what you do not want (?:badword|badword2) \$ \d+(?:\.\d+)* and then capture in a group \$ (\d+(?:\.\d+)*) what you do want using an alternation:

(?:badword|badword2) \$ \d+(?:\.\d+)*|\$ (\d+(?:\.\d+)*)

That would match

  • (?: Non capturing group
    • badword|badword2 Match bad words
  • ) Close non capturing group
  • \$ Match whitespace $ whitespace
  • \d+(?:\.\d+)* Match 1 or more digits followed by (a dot and 1 or more digits) repeated 0 or more times
  • | Or
  • \$ Match whitespace $ whitespace
  • ( Capturing group (Your digits will be in here)
    • \d+(?:\.\d+)* Match 1 or more digits followed by (a dot and 1 or more digits) repeated 0 or more times
  • ) Close capturing group

You can extend the alternation with the badwords you want to add.

Upvotes: 2

ctwheels
ctwheels

Reputation: 22817

Personally, I'd use this more pythonic approach using list comprehension. It basically extracts the price parts (potential words, price) into groups, then removes the matches whose word group contains badword, then prints only the price value.

See code in use here

import re

s = "foo bar $ 123.456 bar foo $ 652 $ 1.255.250 bar $ 2.000 foo badword $ 300.000 foo bar $ 123 badword2 $ 400"
r = re.compile(r"([^$]+)\$\s*(\d{1,3}(?:\.\d{3})*)")
print([x[1] for x in r.findall(s) if "badword" not in x[0]])

The regex used in the code above is:

([^$]+)\$\s*(\d{1,3}(?:\.\d{3})*)

The following regular expression may also be used:

([^$]+)\$\s*([\d.]+)

Upvotes: 0

Related Questions