Reputation: 1194
I'm using python 3.6 through 3.8.
I'm trying to replace any instance of a single newline with a single space in text read from a file. My goal is to compress paragraphs into single lines of text for re-wrapping by textwrap
. Since textwrap
only works on a single paragraph I need an easy way to detect/delineate paragraphs, and compressing them into a single line of text seems the most expedient. In order for this to work, any instance of two or more newlines in sequence define a paragraph boundary and should be left alone.
My first try was with lookahead/lookbehind assertions to insist that any newline I replace not be bounded by other newlines:
re.sub(r'(?<!\n)\n(?!\n)', ' ', input_text)
This works fine is most circumstances. However, I quickly ran into a case where someone had a paragraph separator that contained other whitespace.
This is some sample text beginning with a short paragraph.\n\nThis second paragraph is long enough to be split across lines, so it contains\na single newline in the middle.\n \nThis third paragraph has an unusual separator before it; a newline followed by\na space followed by another newline. It's a special case that needs to be\nhandled.
My lookahead/lookbehind assertion tactic won't work here, because the required lookbehind needs to be of an indeterminate length (maybe the space is there, maybe it isn't) and that's not allowed.
# this is an error
re.sub(r'(?<!\n\s*)\n(?!\s*\n)', ' ', input_text)
My next try was to do this in two passes, removing any non-newline whitespace between newlines, but I can't find a regex that will do that perfectly. This works, sortof, but will compress any occurrences of more than two newlines.
# this compresses "\n\n\n" or "\n\n \n" into "\n\n"
re.sub(r'(?<!\n)\n(?!\n)', ' ', re.sub(r'\n\s*\n', '\n\n', input_text))
I'd like to avoid this, because extra blank lines between paragraphs may be intentional; they should be left alone.
The unicode definition of \s
isn't specific enough to allow me to construct a character set of "all whitespace except newlines", so I can't do something like this:
# this only works for ASCII
re.sub(r'(?<!\n)\n(?!\n)', ' ', re.sub(r'\n[ \t\r\f\v]*\n', '\n\n', input_text))
To do that I need a way to express "\s
except \n
" for unicode and I don't think that exists. I tried [\s!\n]
on a lark and, bizarrely, it seems to do the right thing in 3.6.5 and 3.8.0. This, despite the fact that !
has no documented effect inside a character set for either version, and that the documentation for re.escape()
explicitly states that, as of 3.7, !
is no longer escaped by the method as it's not a special character.
# this appears to work, but the docs say it shouldn't
re.sub(r'(?<!\n)\n(?!\n)', ' ', re.sub(r'\n[\s!\n]\n', '\n\n', input_text))
Even though it seems to work, I don't want to rely on the behaviour, for obvious reasons. I should probably report it as a bug in either the code or the documentation.
Assuming that last one is not supposed to be supported, what other approach am I missing?
Upvotes: 2
Views: 844
Reputation: 626709
You may capture the occurrences of double and more newlines to keep them when matched and just match all other newlines:
import re
text = "This is some sample text beginning with a short paragraph.\n\nThis second paragraph is long enough to be split across lines, so it contains\na single newline in the middle.\n \nThis third paragraph has an unusual separator before it; a newline followed by\na space followed by another newline. It's a special case that needs to be\nhandled."
print( re.sub(r'([^\S\n]*\n(?:[^\S\n]*\n)+[^\S\n]*)|[^\S\n]*\n[^\S\n]*', lambda x: x.group(1) or ' ', text) )
See the Python demo
Details
([^\S\n]*\n(?:[^\S\n]*\n)+[^\S\n]*)
- Group 1: 0+ whitespaces other than a newline, a newline, then 1 or more (so, at least two newlines are matched) occurrences of 0+ whitespaces other than a newline and a newline, and then again 0+ whitespaces other than a newline|
- or [^\S\n]*\n[^\S\n]*
- 0+ whitespaces other than a newline, a newline and again 0+ whitespaces other than a newlineThe replacement is lambda x: x.group(1) or ' '
: if Group 1 matched, no replacement should occur, else, replace with a space.
Upvotes: 2