Niklas9
Niklas9

Reputation: 9396

Python string splitlines() removes certain Unicode control characters

I noticed that Python's standard string method splitlines() actually removes some crucial Unicode control characters as well. Example

>>> s1 = u'asdf \n fdsa \x1d asdf'
>>> s1.splitlines()
[u'asdf ', u' fdsa ', u' asdf']

Notice how the "\x1d" character quietly disappears.

It doesn't happen if the string s1 is still a Python bytestring though (without the "u" prefix):

>>> s2 = 'asdf \n fdsa \x1d asdf'
>>> s2.splitlines()
['asdf ', ' fdsa \x1d asdf']

I can't find any information about this in the reference https://docs.python.org/2.7/library/stdtypes.html#str.splitlines.

Why does this happen? What other characters than "\x1d" (or unichr(29)) are affected?

I'm using Python 2.7.3 on Ubuntu 12.04 LTS.

Upvotes: 7

Views: 2578

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1122382

This is indeed under-documented; I had to dig through the source code somewhat to find it.

The unicodetype_db.h file defines linebreaks as:

case 0x000A:
case 0x000B:
case 0x000C:
case 0x000D:
case 0x001C:
case 0x001D:
case 0x001E:
case 0x0085:
case 0x2028:
case 0x2029:

These are generated from the Unicode database; any codepoint listed in the Unicode standard with the Line_Break property set to BK, CR, LF or NL or with bidirectional category set to B (paragraph break) is considered a line break.

From the Unicode Data file, version 6 of the standard lists U+001D as a paragraph break:

001D;<control>;Cc;0;B;;;;;N;INFORMATION SEPARATOR THREE;;;;

(5th column is the bidirectional category).

You could use a regular expression if you want to limit what characters to split on:

import re

linebreaks = re.compile(ur'[\n-\r\x85\u2028\u2929]')
linebreaks.split(yourtext)

would split your text on the same set of linebreaks except for the U+001C, U+001D or U+001E codepoints, so the three data structuring control characters.

Upvotes: 13

Related Questions