Gregor
Gregor

Reputation: 23

Test for email charsets with python

I want to import emails from an mbox format into a Django app. All database tables are Unicode. My problem: sometimes the wrong charset is given, sometimes none at all. What is the best way to deal with these encoding issues?

So far, I merely nest exceptions to try the two most common charsets I receive mails in (utf-8 and iso-8859-1):

    if (not message.is_multipart()):
        message_charset = message.get_content_charset()
        msg.message = message_charset + unicode(message.get_payload(decode=False), message_charset)
    else:
        for part in message.walk():
            if part.get_content_type() == "text/plain":
                message_charset = part.get_content_charset()
                try:
                    msg.message = message_charset + unicode(part.get_payload(decode=False), message_charset)
                except(UnicodeDecodeError):
                    try:
                        msg.message = message_charset + unicode(part.get_payload(decode=False), "utf-8")
                    except(UnicodeDecodeError):
                        msg.message = message_charset + unicode(part.get_payload(decode=False), "iso-8859-1")

Is there a better, more robust way?

Thanks!

Upvotes: 2

Views: 476

Answers (2)

John Machin
John Machin

Reputation: 82934

I'm sorry but your strategy is WRONG.

Firstly, there are encodings that were deliberately designed to fly under the 7-bit ASCII radar so that they could be used in early email systems. The Chinese HZ encoding is little used these days but Japanese email seems to use ISO-2022-JP quite frequently. Both of those would be wrongly interpreted as ASCII if you tried that first; your current strategy would wrongly interpret them as UTF-8. It would also interpret restricted (all chars < U+0080) UTF-16 text as UTF-8.

Secondly, ISO-8859-1 maps each of all 256 possible bytes to a Unicode character. random_garbage.decode('iso-8859-1') will never raise an exception. In other words, anything that fails the UTF-8 test will be interpreted as 'ISO-8859-1' by your strategy.

Do what the man said: use chardet right from the start. It knows in what order the tests should be done.

Upvotes: 0

RichieHindle
RichieHindle

Reputation: 281515

You could ask the excellent chardet library to guess the encoding.

"Character encoding auto-detection in Python 2 and 3. As smart as your browser. Open source."

Upvotes: 1

Related Questions