jokol
jokol

Reputation: 383

Python regex: Getting all numbers besides some which are followed by specific terms

The goal is to get all the numbers from a text besides those which are either followed by or are trailing specific words/characters (including ignoring date). What I am struggling with is negative lookbehind

For example: 4.5 $55 1,200 wordA 3 sometext 2 wordB sometext 4.3charA sometext charB21.6 sometext 11/10/22

In the sample numbers 3, 2, 4.3, 21.6 and the date 11/10/22 would be ignored

My attempt https://regex101.com/r/PQvtOl/1/

(\d*\b[\.,]?\d+)(?!\d*? (?:wordB))(?!\d*?(?:charA))((?!\b[charB/])(?!\d+))

Any help would be greatly appreciated!

Upvotes: 1

Views: 241

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627083

You can use

(?<!\d)\d{1,2}/\d{1,2}/\d{2}(?:\d{2})?(?!\d)|\b(?:charB|wordA)\s*\d*[.,]?\d+|(?<!\d[.,])(?<!\d)(\d*[.,]?\d+)(?!\s*(?:wordB|charA)|[.,]?\d)

Get only those matches that are captured into capturing group #1. See the regex demo. Details:

  • (?<!\d)\d{1,2}/\d{1,2}/\d{2}(?:\d{2})?(?!\d)| - a date-like string: no digit allowed immediately on the left, then one or two digits, /, one or two digits, /, and then two or four digits with no extra digit on the right allowed, or
  • \b(?:charB|wordA)\s*\d*[.,]?\d+ - a word boundary, then charB or wordA, zero or more whitespaces, zero or more digits, an optional dot or comma, one or more digits
  • | - or (the next part is captured, and re.findall will only output those in the resulting list, the above ones will be discarded)
  • (?<!\d[.,])(?<!\d)(\d*[.,]?\d+)(?!\s*(?:wordB|charA)|[.,]?\d) - no digit or digit and a . or , allowed immediately on the left, then zero or more digits, an optional . or , and one or more digits are captured into Group 1, and then the negative lookahead fails the match if there is wordB, charA or an optional . or , and a digit appear immediately on the right after any zero or more whitespaces.

See the Python demo:

import re
text = '4.5 $55 1,200 wordA 3 sometext 2 wordB sometext 4.3charA sometext charB21.6 sometext 11/10/22'
rx = r'(?<!\d)\d{1,2}/\d{1,2}/\d{2}(?:\d{2})?(?!\d)|\b(?:charB|wordA)\s*\d*[.,]?\d+|(?<!\d[.,])(?<!\d)(\d*[.,]?\d+)(?!\s*(?:wordB|charA)|[.,]?\d)'
matches = re.findall(rx, text)
print( [ m for m in matches if m ] )
# => ['4.5', '55', '1,200']

Upvotes: 1

Related Questions