x-yuri
x-yuri

Reputation: 18963

Split by double newline, prioritizing crlf

The naive way to accomplish this would be:

import re
re.split(r'(?:\r\n|\r|\n){2}', '...')

But:

>>> re.split(r'(?:\r\n|\r|\n){2}', '\r\n\r\n\r\n')
['', '', '']

I'd like to get ['', '\r\n'] in this case. I probably need some sort of possessiveness or make it not backtrack. Is there a way?

Upvotes: 1

Views: 198

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627110

You may restrict the \n and \r matching positions using lookarounds to avoid matching them when in a CRLF:

r'(?:\r\n|\r(?!\n)|(?<!\r)\n){2}'

Python test:

>>> import re
>>> re.split(r'(?:\r\n|\r(?!\n)|(?<!\r)\n){2}', '\r\n\r\n\r\n')
['', '\r\n']

See the regex graph:

enter image description here

Details

  • (?:\r\n|\r(?!\n)|(?<!\r)\n){2} - a non-capturing group (if you a capturing one, the value captured with the last iteration will be output into the resulting list with re.split, too) that matches two repetitions of:
    • \r\n - a CRLF sequence
    • | - or
    • \r(?!\n) - CR symbol not followed with LF
    • | - or
    • (?<!\r)\n - LF symbol not preceded with CR.

Upvotes: 1

Related Questions