Reputation: 7500
Lets say I have a string like
s=""" Bob sent some money to Ana. It was 10.23 dollars. Ana thanked him.
"""
I want the output to be
Bob sent some money to Ana. It was dollars. Ana thanked him.
So basically only keep alphabets and period which marks the end of sentence. Remove non alphabet character and also periods in between numbers.
I am trying to use
re.sub(r"[^A-za-z.\n]"," ",s)
But this obviously will keep the period in between the no. and gives
' Bob sent some money to Ana. It was . dollars. Ana thanked him. \n\n'
I want to remove the period in between the numbers too as later I want to break a text string into sentences and that would look for periods or \n as end of a sentence. Having a period which was part of a decimal number will break the sentence using that period too and that is not ideal
Upvotes: 0
Views: 566
Reputation: 48741
... and also periods in between numbers.
A period in between numbers means it precedes at least one digit. So you could match these decimal points with \.+(?=\d)
. +
quantifier is not a must but can match edge cases like 1.......2
as well:
re.sub(r"\.+(?=\d)|[^a-z\s.]", "", s, 0, re.IGNORECASE);
You may also want to remove extra leading spaces. If so then consider them in your regex:
\s*(?:\d+\.+(?=\d)|[^a-z\s.])
Upvotes: 1