lazarea
lazarea

Reputation: 1349

How can recode mbox files in utf-8 in Python?

I have exported a bunch of Gmail messages and would like to parse them and get insights using Python. However, upon exporting I realized a weird encoding in these mbox files, e.g. the character 'é' is transformed as =E9, quote symbols (“ and ”) are transformed as =E2=80=9C and =E2=80=9D. My emails often have a lot of foreign script, therefore it would be very important for me to decode these files into utf-8. Furthermore, I often have messages with emojis as well that also convey important sentiment information that I need to preserve.

I found out that this encoding is called Quoted Printable and I tried using the quopri Python module, however, without success.

Here is my simplified code:

import os
import quopri
from pathlib import Path

for filename in os.listdir(directory):
    if filename.endswith(".mbox"): 
        input_filename =  Path(os.path.join(directory,filename))
        output_filename = Path(os.path.join(directory,filename+'_utf-8'))

        with open(input_filename, 'rb'):
            quopri.decode(input_filename, output_filename)

However, when running this, I get the following error at the last line: AttributeError: 'WindowsPath' object has no attribute 'read'. I don't understand why this error appears, as the path defined points to the file.

Upvotes: 1

Views: 398

Answers (1)

snakecharmerb
snakecharmerb

Reputation: 55894

You need to declare names for the context managers (the with statements), like this:

with input_filename.open('rb') as infile, output_filename.open('wb') as outfile:
    quopri.decode(infile, outfile)

Upvotes: 1

Related Questions