Python: Problems with latin characters in output

Question

I have a document in Spanish I'd like to format using Python. Problem is that in the output file, the accented characters are messed up, in this manner: \xc3\xad. I succeeded in keeping the proper characters when I did some similar editing a while back, and although I've tried everything I did then and more, somehow it won't work this time. This is current version of the code:

# -*- coding: utf-8 -*- 

import re
import pickle

inputfile = open("input.txt").read()

pat = re.compile(r"(@.*\*)")

mylist = pat.findall(inputfile)

outputfile = open("output.txt", "w")

pickle.dump(mylist, outputfile)

outputfile.close()

I'm using Python 2.7 on Windows 7. Can anyone see any obvious problems? The inputfile is encoded in utf-8, but I've tried encoding it latin-1 too. Thanks.

To clarify: My problem is that the latin characters doesn't show up properly in the output. It's solved now, I just had to add this line as suggested by mata:

inputfile = inputfile.decode('utf-8')

mata · Accepted Answer

it the input file is encoded in utf-8, then you should decode it first to work with it:

import re
import pickle

inputfile = open("input.txt").read()
inputfile = inputfile.decode('utf-8')

pat = re.compile(r"(@.*\*)")

mylist = pat.findall(inputfile)

outputfile = open("output.txt", "w")

pickle.dump(mylist, outputfile)

outputfile.close()

the so created file will contain a pickled version of your list. it you would rather hava a human readable file, then you might want to just use a plain file.
also a good way to deal with different encodings is using the codecs module:

import re
import codecs

with codecs.open("input.txt", "r", "utf-8") as infile:
    inp = infile.read()

pat = re.compile(r"(@.*\*)")
mylist = pat.findall(inp)

with codecs.open("output.txt", "w", "utf-8") as outfile:
     outfile.write("
".join(mylist))

Python: Problems with latin characters in output

Answers (1)

Related Questions