Brandon Nadeau
Brandon Nadeau

Reputation: 3716

Python Encoding: Open/Read Image File, Decode Image, RE-Encode Image

Note: I don't know much about Encoding / Decoding, but after I ran into this problem, those words are now complete jargon to me.

Question: I'm a little confused here. I was playing around with encoding/decoding images, to store an image as a TextField in a django model, looking around Stack-Overflow I found I could decode an image from ascii(I think or binary? Whatever open('file', 'wb') uses as encoding. I'm assuming the default ascii) to latin1 and store it in a database with no problems.

The problem comes from creating the image from the latin1 decoded data. When attempting to write to a file-handle I get a UnicodeEncodeError saying ascii encoding failed.

I think the problem is when opening a file as binary data (rb) it's not a proper asciiencoding, because it contains binary data. Then I decode the binary data to latin1 but when converting back to ascii (auto encodes when trying to write to the file), it fails, for some unknown reason.

My guess is either that when decoding to latin1 the raw binary data get converted to something else, then when trying to encode back to ascii it can't identify what was once raw binary data. (although the original and decoded data have the same length). Or the problem lies not with the decoding to latin1 but that I'm attempting to ascii encode binary data. In which case how would I encode the latin1 data back to an image.

I know this is very confusing but I'm confused on it all, so I can't explain it well. If anyone can answer this question there probably a riddle master.

some code to visualize:

>>> image_handle = open('test_image.jpg', 'rb')
>>> 
>>> raw_image_data = image_handle.read()
>>> latin_image_data = raw_image_data.decode('latin1')
>>> 
>>> 
>>> # The raw data can't be processed by django 
... # but in `latin1` it works
>>> 
>>> # Analysis of the data
>>> 
>>> type(raw_image_data), len(raw_image_data)
(<type 'str'>, 2383864)
>>> 
>>> type(latin_image_data), len(latin_image_data)
(<type 'unicode'>, 2383864)
>>> 
>>> len(raw_image_data) == len(latin_image_data)
True
>>> 
>>> 
>>> # How to write back to as a file?
>>> 
>>> copy_image_handle = open('new_test_image.jpg', 'wb')
>>> 
>>> copy_image_handle.write(raw_image_data)
>>> copy_image_handle.close()
>>> 
>>> 
>>> copy_image_handle = open('new_test_image.jpg', 'wb')
>>> 
>>> copy_image_handle.write(latin_image_data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
>>> 
>>> 
>>> latin_image_data.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
>>> 
>>> 
>>> latin_image_data.decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

Upvotes: 1

Views: 13973

Answers (2)

Brandon Nadeau
Brandon Nadeau

Reputation: 3716

Unlike normal/pain text files an image file does not have any encoding, the data shown is a visual representation of the binary equivalent of the image. Like @cameron-f says above in the question comments, this is basically gibberish and any encoding done will break the image file so don't try it.

But that doesn't mean all hope is lost. Here's a way I usually turn an image to a string and back to an image.

from base64 import b64decode, b64encode

image_handle = open('test_image.jpg', 'rb')

raw_image_data = image_handle.read()

encoded_data = b64encode(raw_image_data)
compressed_data = zlib.compress(encoded_image, 9) 

uncompressed_data = zlib.decompress(compressed_data)
decoded_data = b64decode(uncompressed_data)

new_image_handle = open('new_test_image.jpg', 'wb')

new_image_handle.write(decoded_data)
new_image_handle.close()
image_handle.close()


# Data Types && Data Size Analysis
type(raw_image_data), len(raw_image_data)
>>> (<type 'str'>, 2383864)

type(encoded_image), len(encoded_image)
>>> (<type 'str'>, 3178488)

type(compressed_data), len(compressed_data)
>>> (<type 'str'>, 2189311)

type(uncompressed_data), len(uncompressed_data)
>>> (<type 'str'>, 3178488)

type(decode_data), len(decode_data)
>>> (<type 'str'>, 2383864)



# Showing that the conversions were successful
decode_data == raw_image_data
>>> True

encoded_data == uncompressed_data
>>> True

Upvotes: 4

cameron-f
cameron-f

Reputation: 431

The UnicodeEncodeError is popping up because a jpeg is a binary file and ASCII encoding is for plain text in plain text files.

Plain text files can be created with generic text editors like notepad for Windows or nano for Linux. Most will either use ASCII or Unicode encoding. When a text editor is reading an ASCII file it will grab a byte, say 01100001 (97 in dec), and find the corresponding glyph, 'a'.

So when a text editor tries to read a jpg it will grab the same byte 01100001 and get 'a', but since the file holds information for displaying a photo the text will just be jibberish. Try opening the jpeg in notepad or nano.

As for encoding here is an explanation: What is the difference between encode/decode?

Upvotes: 1

Related Questions