Reputation: 274
I have the following string:
mystring= "Foo some \n information \n Bar some \n more \n information \n Baz more \n information"
I would like to keep "\n" only when it precedes a word that starts with a capital letter. I would like to remove all other instances of "\n" in my sentence.
Desired output:
"Foo some information \n Bar some more information \n Baz more information"
Is there a way to do this with re.sub? I can think of trying to split the words and use the word[0].isupper()
argument. However, I believe there may be a way to identify Capital words with regex.
Upvotes: 0
Views: 91
Reputation: 110725
If the text may span paragraphs (notwithstanding the reference to "sentence" in the question), you could use the regex
*\n *(?!\n*[A-Z])
(with a space preceding the first *
).
Matches are replaced with a single space.
This performs the following operations:
* * match 0+ spaces
\n * match a newline char
* * match 0+ spaces
(?!\n*[A-Z]) * match 0+ newlines followed by an uc letter
* in a negative lookahead
As shown at the link, the text
Now is the time for all good regexers
to social distance themselves.
Here's to negative lookbehinds!
And also to positive lookbehinds!
becomes
Now is the time for all good regexers to social distance themselves.
Here's to negative lookbehinds!
And also to positive lookbehinds!
even though the newline character following negative lookbehinds!
is not followed directly by an upper case letter, but by another newline followed by an upper case letter.
If the string ends with a newline it will be removed. That's because I'm using a negative lookahead rather than a positive one.
Upvotes: 1
Reputation: 785551
You may use this negative lookahead regex:
>>> mystring = "Foo some \n information \n Bar some \n more \n information \n Baz more \n information"
>>> print (re.sub(r'\n(?! *[A-Z]) *', '', mystring))
Foo some information
Bar some more information
Baz more information
RegEx Details:
\n
: Match a line break(?! *[A-Z]) *
: Negative lookahead to assert we don't have an upper case letter after optional spaces. match 0 or more spaces afterwards.Upvotes: 1