Reputation: 1349
I have exported a bunch of Gmail messages and would like to parse them and get insights using Python. However, upon exporting I realized a weird encoding in these mbox files, e.g. the character 'é' is transformed as =E9
, quote symbols (“ and ”) are transformed as =E2=80=9C
and =E2=80=9D
. My emails often have a lot of foreign script, therefore it would be very important for me to decode these files into utf-8. Furthermore, I often have messages with emojis as well that also convey important sentiment information that I need to preserve.
I found out that this encoding is called Quoted Printable and I tried using the quopri
Python module, however, without success.
Here is my simplified code:
import os
import quopri
from pathlib import Path
for filename in os.listdir(directory):
if filename.endswith(".mbox"):
input_filename = Path(os.path.join(directory,filename))
output_filename = Path(os.path.join(directory,filename+'_utf-8'))
with open(input_filename, 'rb'):
quopri.decode(input_filename, output_filename)
However, when running this, I get the following error at the last line: AttributeError: 'WindowsPath' object has no attribute 'read'
. I don't understand why this error appears, as the path defined points to the file.
Upvotes: 1
Views: 398
Reputation: 55894
You need to declare names for the context managers (the with
statements), like this:
with input_filename.open('rb') as infile, output_filename.open('wb') as outfile:
quopri.decode(infile, outfile)
Upvotes: 1