Decoding Mail Subject Thunderbird in Python 3.x

Question

For a workaround, see below

/Original Question:

Sorry, I am simply too dumb to solve this on my own. I am trying to read the "subjects" from several emails stored in a .mbox folder from Thunderbird. Now, I am trying to decode the header with decode_header(), but I am still getting UnicodeErrors.

I am using the following function (I am sure there is a smarter way to do this, but this is not the point of this post)

import mailbox
from email.header import decode_header

mflder = mailbox.mbox("mailfolder")

for message in mflder:
    print(header_to_string(message["subject"]))

def header_to_string(header):
    try:
        header, encoding = decode_header(header)[0]
    except:
        return "something went wrong {}".format(header)
    if encoding == None:
        return header
    else:
        return header.decode(encoding)

The first 100 outputs or so are perfectly fine, but then this error message appears:

---------------------------------------------------------------------------
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
 in 
----> 1 for message in mflder:
      2     try:
      3         print(header_to_string(message["subject"]))
      4     except:
      5         print("0")

~\anaconda3\lib\mailbox.py in itervalues(self)
    107         for key in self.iterkeys():
    108             try:
--> 109                 value = self[key]
    110             except KeyError:
    111                 continue

~\anaconda3\lib\mailbox.py in __getitem__(self, key)
     71         """Return the keyed message; raise KeyError if it doesn't exist."""
     72         if not self._factory:
---> 73             return self.get_message(key)
     74         else:
     75             with contextlib.closing(self.get_file(key)) as file:

~\anaconda3\lib\mailbox.py in get_message(self, key)
    779         string = self._file.read(stop - self._file.tell())
    780         msg = self._message_factory(string.replace(linesep, b'
'))
--> 781         msg.set_from(from_line[5:].decode('ascii'))
    782         return msg
    783 

UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 4: ordinal not in range(128)

How can I force mailbox.py to decode a different encoding? Or is the header simply broken? And if I understood this correctly, headers are supposed to be "ASCII", right? I mean, this is the point of this entire MIME thing, no?

Thanks for your help!

/Workaround

I found a workaround by just avoiding to directly iterate over the .mbox mailfolder representation. Instead of using ...

for message in mflder:
    # do something

... simply use:

for x in range(len(mflder)):
    try:
        message = mflder[x]
        print(header_to_string(message["subject"]))
    except:
        print("Failed loading message!")

This skips the broken messages in the .mbox folder. Yet, I stumbled upon several other issues while working with the .mbox folder subjects. For instance, the headers are sometimes split into several tuples when using the decode_header() function. So, in order to receive the full subjects, one needs to add more stuff to the header_to_string() function as well. But this is not related to this question anymore. I am a noob and a hobby prgrammer, but I remember working with the Outlook API and Python, which was MUCH easier...

Decoding Mail Subject Thunderbird in Python 3.x

Answers (1)

Solution

Previous answer that didn't solve the original problem but still applies

Related Questions