Reputation: 383
The goal is to get all the numbers from a text besides those which are either followed by or are trailing specific words/characters (including ignoring date). What I am struggling with is negative lookbehind
For example: 4.5 $55 1,200 wordA 3 sometext 2 wordB sometext 4.3charA sometext charB21.6 sometext 11/10/22
In the sample numbers 3, 2, 4.3, 21.6 and the date 11/10/22 would be ignored
My attempt https://regex101.com/r/PQvtOl/1/
(\d*\b[\.,]?\d+)(?!\d*? (?:wordB))(?!\d*?(?:charA))((?!\b[charB/])(?!\d+))
Any help would be greatly appreciated!
Upvotes: 1
Views: 241
Reputation: 627083
You can use
(?<!\d)\d{1,2}/\d{1,2}/\d{2}(?:\d{2})?(?!\d)|\b(?:charB|wordA)\s*\d*[.,]?\d+|(?<!\d[.,])(?<!\d)(\d*[.,]?\d+)(?!\s*(?:wordB|charA)|[.,]?\d)
Get only those matches that are captured into capturing group #1. See the regex demo. Details:
(?<!\d)\d{1,2}/\d{1,2}/\d{2}(?:\d{2})?(?!\d)|
- a date-like string: no digit allowed immediately on the left, then one or two digits, /
, one or two digits, /
, and then two or four digits with no extra digit on the right allowed, or\b(?:charB|wordA)\s*\d*[.,]?\d+
- a word boundary, then charB
or wordA
, zero or more whitespaces, zero or more digits, an optional dot or comma, one or more digits|
- or (the next part is captured, and re.findall
will only output those in the resulting list, the above ones will be discarded)(?<!\d[.,])(?<!\d)(\d*[.,]?\d+)(?!\s*(?:wordB|charA)|[.,]?\d)
- no digit or digit and a .
or ,
allowed immediately on the left, then zero or more digits, an optional .
or ,
and one or more digits are captured into Group 1, and then the negative lookahead fails the match if there is wordB
, charA
or an optional .
or ,
and a digit appear immediately on the right after any zero or more whitespaces.See the Python demo:
import re
text = '4.5 $55 1,200 wordA 3 sometext 2 wordB sometext 4.3charA sometext charB21.6 sometext 11/10/22'
rx = r'(?<!\d)\d{1,2}/\d{1,2}/\d{2}(?:\d{2})?(?!\d)|\b(?:charB|wordA)\s*\d*[.,]?\d+|(?<!\d[.,])(?<!\d)(\d*[.,]?\d+)(?!\s*(?:wordB|charA)|[.,]?\d)'
matches = re.findall(rx, text)
print( [ m for m in matches if m ] )
# => ['4.5', '55', '1,200']
Upvotes: 1