Reputation: 1357
RESOLVED: Problem had to do with Python version, refer to stackoverflow.com/a/5513856/2540382
I am fiddling with htm -> txt
file conversion and am having a little trouble. My project is essentially to convert the messages.htm
file I downloaded of my Facebook chat history into a messages.txt
file with all the <>
brackets removed and formatting preserved.
The file messages.htm
is parsed into variable text
.
I then run:
target = open('output.txt', 'w')
target.write(text)
target.close
This seems to work except when I hit an invalid character. As seen in the error below. Is there a way to either:
Skip the line with the invalid character while writing?
Figure out where the invalid characters are and remove the corresponding character or line?
The desired outcome is to avoid having strange characters all together if possible.
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\U000fe333' in position 37524: character
maps to <undefined>
Upvotes: 2
Views: 4309
Reputation: 922
target = open('output.txt', 'wb')
target.write(text.encode('ascii', 'ignore'))
target.close()
For the "errors" argument to .encode(..), 'ignore' will strip out those characters, and 'replace' will replace them with '?'.
To test this, I replaced the write line with
target.write(u"foo\U000fe333bar".encode("ascii", "ignore"))
and confirmed that output.txt contained only "foobar".
UPDATE: I edited the open(.., 'w')
to open(.., 'wb')
to make sure this would work in Python 3 as well.
Upvotes: 3