Reputation: 1
Here is what I have so far:
text = re.sub((?<=\.)(?=[A-Z]), text)
This already avoids numbers and it gets around non-capital letters, but I need it to account for the edge case where initials are separated by periods.
An example sentence where I wouldn't want to add a space would be:
The U.S. health care is more expensive than U.K health care.
Currently, my regex makes it like:
The U. S. health care is more expensive than U. K health care.
But I want it to look exactly like the first sentence without the spaces separating U.S and U.K
I'm not sure how to do this, any advice would be appreciated!
EDIT:
(?<=\.)(?=[A-Z][a-z]{1,})
makes it so that it avoids one word abbreviations.
Upvotes: 0
Views: 311
Reputation:
I think that this does what you want. We find points which do not have a capital letter before them, nor a space after.
import re
text="The U.S. health care is more expensive than U.K health care.The end."
text = re.sub(r'((?<![A-Z])\.(?!\s))',r'\1 ', text)
print('<',text,'>')
output (with '<' and '>' to show the beginning and end of the text more clearly.
< The U.S. health care is more expensive than U.K health care. The end. >
Upvotes: 0