Test for email charsets with python

Question

I want to import emails from an mbox format into a Django app. All database tables are Unicode. My problem: sometimes the wrong charset is given, sometimes none at all. What is the best way to deal with these encoding issues?

So far, I merely nest exceptions to try the two most common charsets I receive mails in (utf-8 and iso-8859-1):

    if (not message.is_multipart()):
        message_charset = message.get_content_charset()
        msg.message = message_charset + unicode(message.get_payload(decode=False), message_charset)
    else:
        for part in message.walk():
            if part.get_content_type() == "text/plain":
                message_charset = part.get_content_charset()
                try:
                    msg.message = message_charset + unicode(part.get_payload(decode=False), message_charset)
                except(UnicodeDecodeError):
                    try:
                        msg.message = message_charset + unicode(part.get_payload(decode=False), "utf-8")
                    except(UnicodeDecodeError):
                        msg.message = message_charset + unicode(part.get_payload(decode=False), "iso-8859-1")

Is there a better, more robust way?

Thanks!

RichieHindle · Accepted Answer

You could ask the excellent chardet library to guess the encoding.

"Character encoding auto-detection in Python 2 and 3. As smart as your browser. Open source."

Test for email charsets with python

Answers (2)

Related Questions