MikeiLL
MikeiLL

Reputation: 6550

Find 2 or more Newlines

My string looks like:

'I saw a little hermit crab\r\nHis coloring was oh so drab\r\n\r\nIt\u2019s hard to see the butterfly\r\nBecause he flies across the sky\r\n\r\nHear the honking of the goose\r\nI think he\u2019s angry at the moose\r\n\r\'

And I need to split it wherever there are two or more newlines.

Am using the re module, of course.

On this particular string re.split(r'\r\n\r\n+', text) works, but it wouldn't catch \r\n\r\n\r\n, right?

I have tried re.split(r'(\r\n){2,}', text), which splits at every line and re.split(r'\r\n{2,}', text), which creates a list of len() 1.

Shouldn't re.split(r'(\r\n){2,}', text) == re.split(r'\r\n\r\n', text) be True for a string in which there are no consecutive occurrences of more than 2 \r\n?

Upvotes: 0

Views: 134

Answers (2)

hwnd
hwnd

Reputation: 70722

You want to use a Non-capturing group instead of a capturing group when you execute the call to re.split(). In the documentation, it is clearly stated that using a capturing group retains the separator pattern:

If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

re.split(r'(?:\r\n){2,}', text)

Upvotes: 2

Aran-Fey
Aran-Fey

Reputation: 43136

re.split(r'(\r\n){2,}', text) doesn't split at every line. It does exactly what you want, except it preserves one occurence of \r\n because you've enclosed it in a capturing group. Use a non-capturing group instead:

(?:\r\n){2,}

Here you can see what the difference is:

>>> re.split(r'(?:\r\n){2,}', 'foo\r\n\r\nbar')
['foo', 'bar']
>>> re.split(r'(\r\n){2,}', 'foo\r\n\r\nbar')
['foo', '\r\n', 'bar']

Upvotes: 2

Related Questions