Reputation: 43426
Python 3 cleans up Python's handling of Unicode strings. I assume as part of this effort, the codecs in Python 3 have become more restrictive, according to the Python 3 documentation compared to the Python 2 documentation.
For example, codecs that conceptually convert a bytestream to a different form of bytestream have been removed:
And codecs that conceptually convert Unicode to a different form of Unicode have also been removed (in Python 2 it actually went between Unicode and bytestream, but conceptually it's really Unicode to Unicode I reckon):
My main question is, what is the "right way" in Python 3 to do what these removed codecs used to do? They're not codecs in the strict sense, but "transformations". But the interface and implementation would be very similar to codecs.
I don't care about rot_13, but I'm interested to know what would be the "best way" to implement a transformation of line ending styles (Unix line endings vs Windows line endings) which should really be a Unicode-to-Unicode transformation done before encoding to byte stream, especially when UTF-16 is being used, as discussed this other SO question.
Upvotes: 3
Views: 7482
Reputation: 43426
It looks as though all these non-codec modules are being handled on a case-by-case basis. Here's what I've found so far:
hexlify
and unhexlify
functions of the binascii module (a bit of a hidden feature)I guess that means there's no standard framework for creating such string/bytearray transformation modules, but they're being done on a case-by-case basis in Python 3.
A comment on a blog post "Compressing text using Python’s unicode support" alerted me to the fact that these codecs are back for Python 3.2.
Quoting the comment:
Since these are “text-to-text” or “binary-to-binary” transforms, though, the encode()/decode() methods in Python 3.x don’t support this style of usage – it’s a Python 2.x only feature).
The codecs themselves are back in 3.2, but you need to go through the codecs module API in order to use them – they aren’t available via the object method shorthand.
Look in the Python 3 docs for codecs
— Binary Transforms.
From a blog post by Barry Warsaw:
Did you know that Python 2 provides some codecs for doing interesting conversions such as Caeser rotation (i.e. rot13)? Thus, you can do things like:
>>> 'foo'.encode('rot-13') 'sbb'
This doesn't work in Python 3 though, because even though certain str-to-str codecs like rot-13 still exist, the str.encode() interface requires that the codec return a bytes object. In order to use str-to-str codecs in both Python 2 and Python 3, you'll have to pop the hood and use a lower-level API, getting and calling the codec directly:
>>> from codecs import getencoder >>> encoder = getencoder('rot-13') >>> rot13string = encoder(mystring)[0]
You have to get the zeroth-element from the return value of the encoder because of the codecs API. A bit ugly, but it works in both versions of Python.
Upvotes: 6
Reputation: 21079
What specifically is your need for line ending conversion? If it's just for writing to a file or file object, you can specify what line ending format to use with open()
, and \n
will automatically be converted to that when you write to a file. Admittedly, this only works with files open as text, not data. (You can also specify what encoding to use when writing text to the file, which can be useful sometimes.)
http://docs.python.org/3.1/library/functions.html#open
To do it with regular strings for conversion, you can simply do yourstring = yourstring.replace('\n', '\r\n')
for conversion from Linux-style to Windows-style, and yourstring = yourstring.replace('\r\n', '\n')
for conversion from Windows-style to Linux-style. You probably already know this, though, and it's probably not what you're looking for. (And, in fact, if you're writing to a text file, it should convert \n
to \r\n
on a Windows system anyway if universal newline mode is enabled, which is the default.)
As well, if you're wanting to convert between the various Unicode mappings (assuming you're working with byte sequences, as the strings Python uses internally aren't actually set to any specific type of Unicode), it's just a matter of decoding the byte sequence using bytes.decode()
or bytearray.decode()
and then encoding using str.encode()
. For a conversion from UTF-8 to UTF-16:
newstring = yourbytes.decode('utf-8')
yourbytes = newstring.encode('utf-16')
There shouldn't be any problems with newline characters not being converted properly between the two Unicode formats when done this way.
There is also str.translate()
and str.maketrans()
, though I'm not sure if those will prove useful:
http://docs.python.org/3.1/library/stdtypes.html#str.translate
http://docs.python.org/3.1/library/stdtypes.html#str.maketrans
On a side note, rot_13 can be implemented as so:
import string
rot_13 = str.maketrans({x: chr((ord(x) - ord('A') + 13) % 26 + ord('A') if x.isupper() else ((ord(x) - ord('a') + 13) % 26 + ord('a'))) for x in string.ascii_letters})
# Using hard-coded values:
rot_13 = str.maketrans('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz', 'NOPQRSTUVWXYZABCDEFGHIJKLMnopqrstuvwxyzabcdefghijklm')
Either way, using S.translate(rot_13)
will cause normal strings to become rot_13
and rot_13
strings to become normal ones.
Upvotes: 2