Python regex to replace single newlines and ignore sequences of two or more newlines

Question

I'm using python 3.6 through 3.8.

I'm trying to replace any instance of a single newline with a single space in text read from a file. My goal is to compress paragraphs into single lines of text for re-wrapping by textwrap. Since textwrap only works on a single paragraph I need an easy way to detect/delineate paragraphs, and compressing them into a single line of text seems the most expedient. In order for this to work, any instance of two or more newlines in sequence define a paragraph boundary and should be left alone.

My first try was with lookahead/lookbehind assertions to insist that any newline I replace not be bounded by other newlines:

re.sub(r'(?



This works fine is most circumstances.  However, I quickly ran into a case where someone had a paragraph separator that contained other whitespace.


  This is some sample text beginning with a short paragraph.

This second paragraph is long enough to be split across lines, so it contains
a single newline in the middle.
 
This third paragraph has an unusual separator before it; a newline followed by
a space followed by another newline.  It's a special case that needs to be
handled.


My lookahead/lookbehind assertion tactic won't work here, because the required lookbehind needs to be of an indeterminate length (maybe the space is there, maybe it isn't) and that's not allowed.

# this is an error
re.sub(r'(?


My next try was to do this in two passes, removing any non-newline whitespace between newlines, but I can't find a regex that will do that perfectly.  This works, sortof, but will compress any occurrences of more than two newlines.

# this compresses "


" or "

 
" into "

"
re.sub(r'(?


I'd like to avoid this, because extra blank lines between paragraphs may be intentional; they should be left alone.

The unicode definition of \s isn't specific enough to allow me to construct a character set of "all whitespace except newlines", so I can't do something like this:

# this only works for ASCII
re.sub(r'(?


To do that I need a way to express "\s except 
" for unicode and I don't think that exists.  I tried [\s!
] on a lark and, bizarrely, it seems to do the right thing in 3.6.5 and 3.8.0.  This, despite the fact that ! has no documented effect inside a character set for either version, and that the documentation for re.escape() explicitly states that, as of 3.7, ! is no longer escaped by the method as it's not a special character.  

# this appears to work, but the docs say it shouldn't
re.sub(r'(?


Even though it seems to work, I don't want to rely on the behaviour, for obvious reasons.  I should probably report it as a bug in either the code or the documentation.

Assuming that last one is not supposed to be supported, what other approach am I missing?

Wiktor Stribiżew · Accepted Answer

You may capture the occurrences of double and more newlines to keep them when matched and just match all other newlines:

import re
text = "This is some sample text beginning with a short paragraph.

This second paragraph is long enough to be split across lines, so it contains
a single newline in the middle.
 
This third paragraph has an unusual separator before it; a newline followed by
a space followed by another newline. It's a special case that needs to be
handled."
print( re.sub(r'([^\S
]*
(?:[^\S
]*
)+[^\S
]*)|[^\S
]*
[^\S
]*', lambda x: x.group(1) or ' ', text) )

See the Python demo

Details

([^\S ]* (?:[^\S ]* )+[^\S ]*) - Group 1: 0+ whitespaces other than a newline, a newline, then 1 or more (so, at least two newlines are matched) occurrences of 0+ whitespaces other than a newline and a newline, and then again 0+ whitespaces other than a newline
| - or
[^\S ]* [^\S ]* - 0+ whitespaces other than a newline, a newline and again 0+ whitespaces other than a newline

The replacement is lambda x: x.group(1) or ' ': if Group 1 matched, no replacement should occur, else, replace with a space.

Python regex to replace single newlines and ignore sequences of two or more newlines

Answers (1)

Related Questions