Pokemon
Pokemon

Reputation: 63

How to identify a series of numbers inside a paragraph

I have a paragraph/sentence from which I want to identify

  1. any series of number 6 digits or more
  2. any series of numbers with a "-" (dash)

but I don't want to identify

  1. any numbers preceded by a $(dollar)
  2. any series of numbers with , (comma)

How can I achieve this?

The regex I tried is: r'(?:\s|^)(\d-?(\s)?){6,}(?=[?\s]|$)' but its not accurate.

I'm looking for these patterns inside a paragraph

Upvotes: 0

Views: 323

Answers (1)

The fourth bird
The fourth bird

Reputation: 163457

You could match what you don't want and capture in a group what you want to keep.

Using re.findall the group 1 values will be returned.

Afterwards you might filter out the empty strings.

(?<!\S)(?:\$\s*\d+(?:\,\d+)?|(\d+(?:[ -]\d+)+\.?|\d{3,}))(?!\S)

In parts

  • (?<!\S) Assert a whitespace boundary on the left
  • (?: Non capture group
    • \$\s* Match a dollar sign, 0+ whitespace chars
    • \d+(?:\,\d+)? Match 1+ digits with an optional comma digits part
    • | Or
    • ( Capture group 1
      • \d+ Match 1+ digits
      • (?:[ -]\d+)+\.? Repeat a space or - 1+ times followed by an optional .
      • | Or
      • \d{3,} Match 3 or more digits (Or use {6,} for 6 or more
    • ) Close group 1
  • ) Close non capture group
  • (?!\S) Assert a whitespace boundary on the right

Regex demo | Python demo | Another Python demo

For example

import re

regex = r"(?<!\S)(?:\$\s*(?:\d+(?:\,\d+)?)|(\d+(?:[ -]\d+)+\.?|\d{3,}))(?!\S)"

test_str = ("123456\n"
    "1234567890\n"
    "12345\n\n"
    "12,123\n"
    "etc...)

print(list(filter(None, re.findall(regex, test_str))))

Output

['123456', '1234567890', '12345', '1-2-3', '123-456-789', '123-456-789.', '123-456', '123 456', '123 456 789', '123 456 789.', '123 456 123 456 789', '123', '456', '123', '456', '789']

Upvotes: 1

Related Questions