Reputation: 63
I have a paragraph/sentence from which I want to identify
but I don't want to identify
How can I achieve this?
The regex I tried is: r'(?:\s|^)(\d-?(\s)?){6,}(?=[?\s]|$)'
but its not accurate.
I'm looking for these patterns inside a paragraph
123 456 789 It may also contain full stop(.) at the end too but it should ignore the following patterns
$123654
Upvotes: 0
Views: 323
Reputation: 163457
You could match what you don't want and capture in a group what you want to keep.
Using re.findall the group 1 values will be returned.
Afterwards you might filter out the empty strings.
(?<!\S)(?:\$\s*\d+(?:\,\d+)?|(\d+(?:[ -]\d+)+\.?|\d{3,}))(?!\S)
In parts
(?<!\S)
Assert a whitespace boundary on the left(?:
Non capture group
\$\s*
Match a dollar sign, 0+ whitespace chars\d+(?:\,\d+)?
Match 1+ digits with an optional comma digits part|
Or(
Capture group 1
\d+
Match 1+ digits(?:[ -]\d+)+\.?
Repeat a space or -
1+ times followed by an optional .
|
Or\d{3,}
Match 3 or more digits (Or use {6,}
for 6 or more)
Close group 1)
Close non capture group(?!\S)
Assert a whitespace boundary on the rightRegex demo | Python demo | Another Python demo
For example
import re
regex = r"(?<!\S)(?:\$\s*(?:\d+(?:\,\d+)?)|(\d+(?:[ -]\d+)+\.?|\d{3,}))(?!\S)"
test_str = ("123456\n"
"1234567890\n"
"12345\n\n"
"12,123\n"
"etc...)
print(list(filter(None, re.findall(regex, test_str))))
Output
['123456', '1234567890', '12345', '1-2-3', '123-456-789', '123-456-789.', '123-456', '123 456', '123 456 789', '123 456 789.', '123 456 123 456 789', '123', '456', '123', '456', '789']
Upvotes: 1