halifaxi
halifaxi

Reputation: 1

Need a Regex that adds a space after a period, but can account for abbreviations such as U.S. or D.C

Here is what I have so far:

text = re.sub((?<=\.)(?=[A-Z]), text)

This already avoids numbers and it gets around non-capital letters, but I need it to account for the edge case where initials are separated by periods.

An example sentence where I wouldn't want to add a space would be:

The U.S. health care is more expensive than U.K health care.

Currently, my regex makes it like:

The U. S. health care is more expensive than U. K health care.

But I want it to look exactly like the first sentence without the spaces separating U.S and U.K

I'm not sure how to do this, any advice would be appreciated!

EDIT:

(?<=\.)(?=[A-Z][a-z]{1,}) 

makes it so that it avoids one word abbreviations.

Upvotes: 0

Views: 311

Answers (1)

user18098820
user18098820

Reputation:

I think that this does what you want. We find points which do not have a capital letter before them, nor a space after.

import re
text="The U.S. health care is more expensive than U.K health care.The end."
text = re.sub(r'((?<![A-Z])\.(?!\s))',r'\1 ', text)
print('<',text,'>')

output (with '<' and '>' to show the beginning and end of the text more clearly.

< The U.S. health care is more expensive than U.K health care. The end.  >

Upvotes: 0

Related Questions