Reputation: 60
For a workaround, see below
/Original Question:
Sorry, I am simply too dumb to solve this on my own. I am trying to read the "subjects" from several emails stored in a .mbox folder from Thunderbird. Now, I am trying to decode the header with decode_header()
, but I am still getting UnicodeErrors.
I am using the following function (I am sure there is a smarter way to do this, but this is not the point of this post)
import mailbox
from email.header import decode_header
mflder = mailbox.mbox("mailfolder")
for message in mflder:
print(header_to_string(message["subject"]))
def header_to_string(header):
try:
header, encoding = decode_header(header)[0]
except:
return "something went wrong {}".format(header)
if encoding == None:
return header
else:
return header.decode(encoding)
The first 100 outputs or so are perfectly fine, but then this error message appears:
---------------------------------------------------------------------------
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-97-e252df04c215> in <module>
----> 1 for message in mflder:
2 try:
3 print(header_to_string(message["subject"]))
4 except:
5 print("0")
~\anaconda3\lib\mailbox.py in itervalues(self)
107 for key in self.iterkeys():
108 try:
--> 109 value = self[key]
110 except KeyError:
111 continue
~\anaconda3\lib\mailbox.py in __getitem__(self, key)
71 """Return the keyed message; raise KeyError if it doesn't exist."""
72 if not self._factory:
---> 73 return self.get_message(key)
74 else:
75 with contextlib.closing(self.get_file(key)) as file:
~\anaconda3\lib\mailbox.py in get_message(self, key)
779 string = self._file.read(stop - self._file.tell())
780 msg = self._message_factory(string.replace(linesep, b'\n'))
--> 781 msg.set_from(from_line[5:].decode('ascii'))
782 return msg
783
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 4: ordinal not in range(128)
How can I force mailbox.py to decode a different encoding? Or is the header simply broken? And if I understood this correctly, headers are supposed to be "ASCII", right? I mean, this is the point of this entire MIME thing, no?
Thanks for your help!
/Workaround
I found a workaround by just avoiding to directly iterate over the .mbox mailfolder representation. Instead of using ...
for message in mflder:
# do something
... simply use:
for x in range(len(mflder)):
try:
message = mflder[x]
print(header_to_string(message["subject"]))
except:
print("Failed loading message!")
This skips the broken messages in the .mbox folder. Yet, I stumbled upon several other issues while working with the .mbox folder subjects. For instance, the headers are sometimes split into several tuples when using the decode_header()
function. So, in order to receive the full subjects, one needs to add more stuff to the header_to_string()
function as well. But this is not related to this question anymore. I am a noob and a hobby prgrammer, but I remember working with the Outlook API and Python, which was MUCH easier...
Upvotes: 1
Views: 1530
Reputation: 151
It looks like either you have corrupted "mailfolder" mbox file or there is a bug in Python's mailbox
module triggered by something in your file. I can't tell what is going on without having the mbox input file or a minimal example input file that reproduces the issue.
You could do some debugging yourself. Each message in the file starts with a "From" line that should look like:
From - Mon Mar 30 18:18:04 2020
From the stack trace you posted, it looks like that line is malformed in one of the messages. Personally, I would use an IDE debugger (PyCharm) track down what the malformed line was, but it can be done with Python's built-in pdb
. Wrap your loop like this:
import pdb
try:
for message in mflder:
print(header_to_string(message["subject"]))
except:
pdb.post_mortem()
When you run the code now, it will drop into the debugger when the exception occurs. At that prompt, you can enter l
to list the code where the debugger stopped; this should match the last frame printed in your stack trace you originally posted. Once you are there, there are two commands that will tell you what is going on:
p from_line
will show you the malformed "From" line.
p start
will show you at what offset in the file the mailbox
code thinks the message was supposed to be.
In the real world, there will be messages that don't comply with the standards. You can try to make the code more tolerant if you don't want to reject the bad messages. Decoding with "latin-1" is one way to handle these headers with bytes outside ASCII. This cannot fail because the all possible byte values map to valid Unicode characters (one-to-one mapping of the first 256 codes of Unicode vs. ISO/IEC 8859-1, a.k.a. "latin-1"). This may or may not give you the text the sender intended.
import mailbox
from email.header import decode_header
mflder = mailbox.mbox("mailfolder")
def get_subject(message):
header = message["subject"]
if not header:
return ''
header, encoding = decode_header(header)[0]
if encoding is not None:
try:
header = header.decode(encoding)
except:
header = header.decode('latin-1')
return header
for message in mflder:
print(get_subject(message))
Upvotes: 2