mpounsett
mpounsett

Reputation: 1194

Python regex to replace single newlines and ignore sequences of two or more newlines

I'm using python 3.6 through 3.8.

I'm trying to replace any instance of a single newline with a single space in text read from a file. My goal is to compress paragraphs into single lines of text for re-wrapping by textwrap. Since textwrap only works on a single paragraph I need an easy way to detect/delineate paragraphs, and compressing them into a single line of text seems the most expedient. In order for this to work, any instance of two or more newlines in sequence define a paragraph boundary and should be left alone.

My first try was with lookahead/lookbehind assertions to insist that any newline I replace not be bounded by other newlines:

re.sub(r'(?<!\n)\n(?!\n)', ' ', input_text)

This works fine is most circumstances. However, I quickly ran into a case where someone had a paragraph separator that contained other whitespace.

This is some sample text beginning with a short paragraph.\n\nThis second paragraph is long enough to be split across lines, so it contains\na single newline in the middle.\n \nThis third paragraph has an unusual separator before it; a newline followed by\na space followed by another newline. It's a special case that needs to be\nhandled.

My lookahead/lookbehind assertion tactic won't work here, because the required lookbehind needs to be of an indeterminate length (maybe the space is there, maybe it isn't) and that's not allowed.

# this is an error
re.sub(r'(?<!\n\s*)\n(?!\s*\n)', ' ', input_text)

My next try was to do this in two passes, removing any non-newline whitespace between newlines, but I can't find a regex that will do that perfectly. This works, sortof, but will compress any occurrences of more than two newlines.

# this compresses "\n\n\n" or "\n\n \n" into "\n\n"
re.sub(r'(?<!\n)\n(?!\n)', ' ', re.sub(r'\n\s*\n', '\n\n', input_text))

I'd like to avoid this, because extra blank lines between paragraphs may be intentional; they should be left alone.

The unicode definition of \s isn't specific enough to allow me to construct a character set of "all whitespace except newlines", so I can't do something like this:

# this only works for ASCII
re.sub(r'(?<!\n)\n(?!\n)', ' ', re.sub(r'\n[ \t\r\f\v]*\n', '\n\n', input_text))

To do that I need a way to express "\s except \n" for unicode and I don't think that exists. I tried [\s!\n] on a lark and, bizarrely, it seems to do the right thing in 3.6.5 and 3.8.0. This, despite the fact that ! has no documented effect inside a character set for either version, and that the documentation for re.escape() explicitly states that, as of 3.7, ! is no longer escaped by the method as it's not a special character.

# this appears to work, but the docs say it shouldn't
re.sub(r'(?<!\n)\n(?!\n)', ' ', re.sub(r'\n[\s!\n]\n', '\n\n', input_text))

Even though it seems to work, I don't want to rely on the behaviour, for obvious reasons. I should probably report it as a bug in either the code or the documentation.

Assuming that last one is not supposed to be supported, what other approach am I missing?

Upvotes: 2

Views: 844

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626709

You may capture the occurrences of double and more newlines to keep them when matched and just match all other newlines:

import re
text = "This is some sample text beginning with a short paragraph.\n\nThis second paragraph is long enough to be split across lines, so it contains\na single newline in the middle.\n \nThis third paragraph has an unusual separator before it; a newline followed by\na space followed by another newline. It's a special case that needs to be\nhandled."
print( re.sub(r'([^\S\n]*\n(?:[^\S\n]*\n)+[^\S\n]*)|[^\S\n]*\n[^\S\n]*', lambda x: x.group(1) or ' ', text) )

See the Python demo

Details

  • ([^\S\n]*\n(?:[^\S\n]*\n)+[^\S\n]*) - Group 1: 0+ whitespaces other than a newline, a newline, then 1 or more (so, at least two newlines are matched) occurrences of 0+ whitespaces other than a newline and a newline, and then again 0+ whitespaces other than a newline
  • | - or
  • [^\S\n]*\n[^\S\n]* - 0+ whitespaces other than a newline, a newline and again 0+ whitespaces other than a newline

The replacement is lambda x: x.group(1) or ' ': if Group 1 matched, no replacement should occur, else, replace with a space.

Upvotes: 2

Related Questions