Reputation: 9396
I noticed that Python's standard string method splitlines() actually removes some crucial Unicode control characters as well. Example
>>> s1 = u'asdf \n fdsa \x1d asdf'
>>> s1.splitlines()
[u'asdf ', u' fdsa ', u' asdf']
Notice how the "\x1d" character quietly disappears.
It doesn't happen if the string s1 is still a Python bytestring though (without the "u" prefix):
>>> s2 = 'asdf \n fdsa \x1d asdf'
>>> s2.splitlines()
['asdf ', ' fdsa \x1d asdf']
I can't find any information about this in the reference https://docs.python.org/2.7/library/stdtypes.html#str.splitlines.
Why does this happen? What other characters than "\x1d" (or unichr(29)) are affected?
I'm using Python 2.7.3 on Ubuntu 12.04 LTS.
Upvotes: 7
Views: 2578
Reputation: 1122382
This is indeed under-documented; I had to dig through the source code somewhat to find it.
The unicodetype_db.h
file defines linebreaks as:
case 0x000A:
case 0x000B:
case 0x000C:
case 0x000D:
case 0x001C:
case 0x001D:
case 0x001E:
case 0x0085:
case 0x2028:
case 0x2029:
These are generated from the Unicode database; any codepoint listed in the Unicode standard with the Line_Break
property set to BK
, CR
, LF
or NL
or with bidirectional category set to B
(paragraph break) is considered a line break.
From the Unicode Data file, version 6 of the standard lists U+001D as a paragraph break:
001D;<control>;Cc;0;B;;;;;N;INFORMATION SEPARATOR THREE;;;;
(5th column is the bidirectional category).
You could use a regular expression if you want to limit what characters to split on:
import re
linebreaks = re.compile(ur'[\n-\r\x85\u2028\u2929]')
linebreaks.split(yourtext)
would split your text on the same set of linebreaks except for the U+001C, U+001D or U+001E codepoints, so the three data structuring control characters.
Upvotes: 13