Reputation: 783
I have a document in Spanish I'd like to format using Python. Problem is that in the output file, the accented characters are messed up, in this manner: \xc3\xad. I succeeded in keeping the proper characters when I did some similar editing a while back, and although I've tried everything I did then and more, somehow it won't work this time. This is current version of the code:
# -*- coding: utf-8 -*-
import re
import pickle
inputfile = open("input.txt").read()
pat = re.compile(r"(@.*\*)")
mylist = pat.findall(inputfile)
outputfile = open("output.txt", "w")
pickle.dump(mylist, outputfile)
outputfile.close()
I'm using Python 2.7 on Windows 7. Can anyone see any obvious problems? The inputfile is encoded in utf-8, but I've tried encoding it latin-1 too. Thanks.
To clarify: My problem is that the latin characters doesn't show up properly in the output. It's solved now, I just had to add this line as suggested by mata:
inputfile = inputfile.decode('utf-8')
Upvotes: 0
Views: 814
Reputation: 69042
it the input file is encoded in utf-8
, then you should decode
it first to work with it:
import re
import pickle
inputfile = open("input.txt").read()
inputfile = inputfile.decode('utf-8')
pat = re.compile(r"(@.*\*)")
mylist = pat.findall(inputfile)
outputfile = open("output.txt", "w")
pickle.dump(mylist, outputfile)
outputfile.close()
the so created file will contain a pickled version of your list. it you would rather hava a human readable file, then you might want to just use a plain file.
also a good way to deal with different encodings is using the codecs
module:
import re
import codecs
with codecs.open("input.txt", "r", "utf-8") as infile:
inp = infile.read()
pat = re.compile(r"(@.*\*)")
mylist = pat.findall(inp)
with codecs.open("output.txt", "w", "utf-8") as outfile:
outfile.write("\n".join(mylist))
Upvotes: 2