Reputation: 385
How to normilize text with regex with some if statements?
If we have string like this
One T933 two, three35.4. four 9,3 8.5 five M2x13 M4.3x2.1
And I want to normilize like this
one t 933 two three 35.4 four 9,3 8.5 five m2x13 m4.3x2.1
T933
--> T 933
35.4
--> 35.4
or 9,3
--> 9.3
if there is comma between, then replace to dotWhat I am able to do is this
def process(str, **kwargs):
str = str.replace(',', '.')
str = re.split(r'(-?\d*\.?\d+)', str)
str = ' '.join(str)
str.lower()
return str
but there is no if condition when numbers starts with letter 'M' and their also is splitted. And in some reason after string process i get some unnecessary spaces.
Is there some ideas how to do that with regex? Or with help methods like replace, lower, join and so on?
Upvotes: 1
Views: 996
Reputation: 626870
I can suggest a solution like
re.sub(r'[.,](?!(?<=\d.)\d)', '', re.sub(r'(?<=[^\W\d_])(?<![MmXx])(?=\d)|(?<=\d)(?=[^\W\d_])', ' ', text)).lower()
The outer re.sub
is meant to remove dots or commas when not between digits:
[.,]
- a comma or dot(?!(?<=\d.)\d)
- a negative lookahead that fails the match if there is a digit immediately to the right, that is immediately preceded with a digit + any one charThe inner re.sub
replaces with a space the following pattern:
(?<=[^\W\d_])(?<![MmXx])(?=\d)
- a location between a letter ([^\W\d_]
matches any letter) and a digit (see (?=\d)
), where the letter is not M
or X
(case insensitive, [MmXx]
can be written as (?i:[mx])
)|
- or(?<=\d)(?=[^\W\d_])
- a location between a digit and a letter.See the Python demo:
import re
text = 'One T933 two, three35.4. four 9,3 8.5 five M2x13 M4.3x2.1 aa88aa'
print( re.sub(r'[.,](?!(?<=\d.)\d)', '', re.sub(r'(?<=[^\W\d_])(?<![MmXx])(?=\d)|(?<=\d)(?=[^\W\d_])', ' ', text)).lower() )
Output:
one t 933 two three 35.4 four 9,3 8.5 five m2 x13 m4.3 x2.1 aa 88 aa
Upvotes: 2