Zexelon
Zexelon

Reputation: 494

Regex number removal from text

I am trying to clean up text for use in a machine learning application. Basically these are specification documents that are "semi-structured" and I am trying to remove the section number that is messing with NLTK sent_tokenize() function.

Here is a sample of the text I am working with:

and a Contract for the work and/or material is entered into with some other person for a
greater amount, the undersigned hereby agrees to forfeit all right and title to the
aforementioned deposit, and the same is forfeited to the Crown.
2.3.3

...

(b)

until thirty-five days after the time fixed for receiving this tender,

whichever first occurs.
2.4

AGREEMENT

Should this tender be accepted, the undersigned agrees to enter into written agreement with
the Minister of Transportation of the Province of Alberta for the faithful performance of the
works covered by this tender, in accordance with the said plans and specifications and
complete the said work on or before October 15, 2019.

I am trying to remove all the section breaks (ex. 2.3.3, 2.4, (b)), but not the date numbers.

Here is the regex I have so far: [0-9]*\.[0-9]|[0-9]\.

Unfortunately it matches part of the date in the last paragraph (2019. turns into 201) and I really dont know how to fix this being a non-expert at regex.

Thanks for any help!

Upvotes: 0

Views: 169

Answers (3)

The fourth bird
The fourth bird

Reputation: 163642

The pattern you tried [0-9]*\.[0-9]|[0-9]\. is not anchored and will match 0+ digits, a dot and single digit or | a single digit and a dot

It does not take the match between parenthesis into account.

Assuming that the section breaks are at the start of the string and perhaps might be preceded with spaces or tabs, you could update your pattern with the alternation to:

^[\t ]*(?:\d+(?:\.\d+)+|\([a-z]+\))
  • ^ Start of string
  • [\t ]* Match 0+ times a space or tab
  • (?: Non capturing group
    • \d+(?:\.\d+)+ Match 1+ digits and repeat 1+ times a dot and 1+ digits to match at least a single dot to match 2.3.3 or 2.4
    • |
    • \([a-z]+\) Match 1+ times a-z between parenthesis
  • ) Close non capturing group

Regex demo | Python demo

For example using re.MULTILINE whers s is your string:

pattern = r"^(?:\d+(?:\.\d+)+|\([a-z]+\))"
result = re.sub(pattern, "", s, 0, re.MULTILINE)

Upvotes: 0

Tim Biegeleisen
Tim Biegeleisen

Reputation: 522817

You may try replacing the following pattern with empty string

((?<=^)|(?<=\n))(?:\d+(?:\.\d+)*|\([a-z]+\))

output = re.sub(r'((?<=^)|(?<=\n))(?:\d+(?:\.\d+)*|\([a-z]+\))', '', input)
print(output)

This pattern works by matching a section number as \d+(?:\.\d+)*, but only if it appears as the start of a line. It also matches letter section headers as \([a-z]+\).

Upvotes: 2

phsa
phsa

Reputation: 61

To your specific case, I think \n[\d+\.]+|\n\(\w\) should works. The \n helps to diferentiate the section.

Upvotes: 0

Related Questions