Reputation: 6550
My string looks like:
'I saw a little hermit crab\r\nHis coloring was oh so drab\r\n\r\nIt\u2019s hard to see the butterfly\r\nBecause he flies across the sky\r\n\r\nHear the honking of the goose\r\nI think he\u2019s angry at the moose\r\n\r\'
And I need to split it wherever there are two or more newlines
.
Am using the re
module, of course.
On this particular string re.split(r'\r\n\r\n+', text)
works, but it wouldn't catch \r\n\r\n\r\n
, right?
I have tried re.split(r'(\r\n){2,}', text)
, which splits at every line and re.split(r'\r\n{2,}', text)
, which creates a list of len()
1.
Shouldn't re.split(r'(\r\n){2,}', text) == re.split(r'\r\n\r\n', text)
be True
for a string in which there are no consecutive occurrences of more than 2 \r\n
?
Upvotes: 0
Views: 134
Reputation: 70722
You want to use a Non-capturing group instead of a capturing group when you execute the call to re.split()
. In the documentation, it is clearly stated that using a capturing group retains the separator pattern:
If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.
re.split(r'(?:\r\n){2,}', text)
Upvotes: 2
Reputation: 43136
re.split(r'(\r\n){2,}', text)
doesn't split at every line. It does exactly what you want, except it preserves one occurence of \r\n
because you've enclosed it in a capturing group. Use a non-capturing group instead:
(?:\r\n){2,}
Here you can see what the difference is:
>>> re.split(r'(?:\r\n){2,}', 'foo\r\n\r\nbar')
['foo', 'bar']
>>> re.split(r'(\r\n){2,}', 'foo\r\n\r\nbar')
['foo', '\r\n', 'bar']
Upvotes: 2