John Fouhy
John Fouhy

Reputation: 42193

Matching blank lines with regular expressions

I've got a string that I'm trying to split into chunks based on blank lines.

Given a string s, I thought I could do this:

re.split('(?m)^\s*$', s)

This works in some cases:

>>> s = 'foo\nbar\n \nbaz'
>>> re.split('(?m)^\s*$', s)
['foo\nbar\n', '\nbaz']

But it doesn't work if the line is completely empty:

>>> s = 'foo\nbar\n\nbaz'
>>> re.split('(?m)^\s*$', s)
['foo\nbar\n\nbaz']

What am I doing wrong?

[python 2.5; no difference if I compile '^\s*$' with re.MULTILINE and use the compiled expression instead]

Upvotes: 10

Views: 28080

Answers (5)

Leroy Scandal
Leroy Scandal

Reputation: 11

Try this:

blank=''
with open('fu.txt') as txt:
    txt=txt.read().split('\n') 
    for line in txt:
        if line is blank: print('blank')
        else: print(line)

Upvotes: 1

Sascha Gottfried
Sascha Gottfried

Reputation: 3329

The re library can split on one or more empty lines ! An empty line is a string that consists of zero or more whitespaces, starts at the start of the line and ends at the end of a line. Special character '$' matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline (excerpt from docs). That's why we need to add a special character '\s*' for the line break. Everything is possible :-)

>>> import re
>>> text = "foo\n   \n    \n    \nbar\n"
>>> re.split("(?m)^\s*$\s*", text)
['foo\n', 'bar\n']

The same regex works with windows style line breaks.

>>> import re
>>> text = "foo\r\n       \r\n     \r\n   \r\nbar\r\n"
>>> re.split("(?m)^\s*$\s*", text)
['foo\r\n', 'bar\r\n']

Upvotes: 3

Glenn Maynard
Glenn Maynard

Reputation: 57514

Try this instead:

re.split('\n\s*\n', s)

The problem is that "$ *^" actually only matches "spaces (if any) that are alone on a line"--not the newlines themselves. This leaves the delimiter empty when there's nothing on the line, which doesn't make sense.

This version also gets rid of the delimiting newlines themselves, which is probably what you want. Otherwise, you'll have the newlines stuck to the beginning and end of each split part.

Treating multiple consecutive blank lines as defining an empty block ("abc\n\n\ndef" -> ["abc", "", "def"]) is trickier...

Upvotes: 19

Sinan Ünür
Sinan Ünür

Reputation: 118158

Is this what you want?

>>> s = 'foo\nbar\n\nbaz'
>>> re.split('\n\s*\n',s)
['foo\nbar', 'baz']

>>> s = 'foo\nbar\n \nbaz'
>>> re.split('\n\s*\n',s)
['foo\nbar', 'baz']

>>> s = 'foo\nbar\n\t\nbaz'
>>> re.split('\n\s*\n',s)
['foo\nbar', 'baz']

Upvotes: 0

Instance Hunter
Instance Hunter

Reputation: 7925

What you're doing wrong is using regular expressions. What is wrong with ('Some\ntext.').split('\n')?

Upvotes: -3

Related Questions