Reputation: 4428
I'm digging through some old binaries that contain (among other things) text. Their text frequently uses custom character encodings for Reasons, and I want to be able to read and rewrite them.
It seems to me that the appropriate way to do this is to create a custom codec using the standard codecs library. Unfortunately its documentation is both colossal and entirely bereft of examples. Google turns up a few, but only for python2, and I'm using 3.
I'm looking for a minimal example of how to use the codecs library to implement a custom character encoding.
Upvotes: 23
Views: 5431
Reputation: 13062
You asked for minimal!
CodecInfo
object constructed from the above encoder and decoder.CodecInfo
object.Here is an example that converts the lowercase letters a-z to 0-25 in order.
import codecs
import string
from typing import Tuple
# prepare map from numbers to letters
_encode_table = {str(number): bytes(letter, 'ascii') for number, letter in enumerate(string.ascii_lowercase)}
# prepare inverse map
_decode_table = {ord(v): k for k, v in _encode_table.items()}
def custom_encode(text: str) -> Tuple[bytes, int]:
# example encoder that converts ints to letters
# see https://docs.python.org/3/library/codecs.html#codecs.Codec.encode
return b''.join(_encode_table[x] for x in text), len(text)
def custom_decode(binary: bytes) -> Tuple[str, int]:
# example decoder that converts letters to ints
# see https://docs.python.org/3/library/codecs.html#codecs.Codec.decode
return ''.join(_decode_table[x] for x in binary), len(binary)
def custom_search_function(encoding_name):
return codecs.CodecInfo(custom_encode, custom_decode, name='Reasons')
def main():
# register your custom codec
# note that CodecInfo.name is used later
codecs.register(custom_search_function)
binary = b'abcdefg'
# decode letters to numbers
text = codecs.decode(binary, encoding='Reasons')
print(text)
# encode numbers to letters
binary2 = codecs.encode(text, encoding='Reasons')
print(binary2)
# encode(decode(...)) should be an identity function
assert binary == binary2
if __name__ == '__main__':
main()
Running this prints
$ python codec_example.py
0123456
b'abcdefg'
See https://docs.python.org/3/library/codecs.html#codec-objects for details on the Codec
interface. In particular, the decode function
... decodes the object input and returns a tuple (output object, length consumed).
whereas the encode function
... encodes the object input and returns a tuple (output object, length consumed).
Note that you should also worry about handling streams, incremental encoding/decoding, as well as error handling. For a more complete example, refer to the hexlify codec that @krs013 mentioned.
P.S. instead of of codec.decode
, you can also use codec.open(..., encoding='Reasons')
.
Upvotes: 19
Reputation: 2949
While the online documentation is certainly sparse, you can get a lot more information by looking at the source code. The docstrings and comments are quite clear, and the definitions for the parent classes (Codec, IncrementalEncoder, etc.) are ready to be copy/pasted for a start to your codec (be sure to replace the object
in each class definition with the name of the class you're inheriting from). It's also worth looking at the example I linked to in the comments for how to assemble/register it.
I've been stuck at the same point as you for a while looking through this, so good luck! If I have time in a few days, I'll see about actually making that implementation and pasting/linking to it here.
Upvotes: 2