Dmiich
Dmiich

Reputation: 385

How to normalize text with regex?

How to normilize text with regex with some if statements?

If we have string like this One T933 two, three35.4. four 9,3 8.5 five M2x13 M4.3x2.1

And I want to normilize like this one t 933 two three 35.4 four 9,3 8.5 five m2x13 m4.3x2.1

  1. Remove all dots and commas.
  2. Split number and string if not starts with letter 'M' T933 --> T 933
  3. All lowercase
  4. Do not split if there is dot or comma between numbers 35.4 --> 35.4 or 9,3 --> 9.3 if there is comma between, then replace to dot

What I am able to do is this

def process(str, **kwargs):
    str = str.replace(',', '.')
    str = re.split(r'(-?\d*\.?\d+)', str)
    str = ' '.join(str)
    str.lower()
    return str

but there is no if condition when numbers starts with letter 'M' and their also is splitted. And in some reason after string process i get some unnecessary spaces.

Is there some ideas how to do that with regex? Or with help methods like replace, lower, join and so on?

Upvotes: 1

Views: 996

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626870

I can suggest a solution like

re.sub(r'[.,](?!(?<=\d.)\d)', '', re.sub(r'(?<=[^\W\d_])(?<![MmXx])(?=\d)|(?<=\d)(?=[^\W\d_])', ' ', text)).lower()

The outer re.sub is meant to remove dots or commas when not between digits:

  • [.,] - a comma or dot
  • (?!(?<=\d.)\d) - a negative lookahead that fails the match if there is a digit immediately to the right, that is immediately preceded with a digit + any one char

The inner re.sub replaces with a space the following pattern:

  • (?<=[^\W\d_])(?<![MmXx])(?=\d) - a location between a letter ([^\W\d_] matches any letter) and a digit (see (?=\d)), where the letter is not M or X (case insensitive, [MmXx] can be written as (?i:[mx]))
  • | - or
  • (?<=\d)(?=[^\W\d_]) - a location between a digit and a letter.

See the Python demo:

import re
text = 'One T933 two, three35.4. four 9,3 8.5 five M2x13 M4.3x2.1 aa88aa'
print( re.sub(r'[.,](?!(?<=\d.)\d)', '', re.sub(r'(?<=[^\W\d_])(?<![MmXx])(?=\d)|(?<=\d)(?=[^\W\d_])', ' ', text)).lower() )

Output:

one t 933 two three 35.4 four 9,3 8.5 five m2 x13 m4.3 x2.1 aa 88 aa

Upvotes: 2

Related Questions