Python Write to Text File Skip Bad Lines

Question

RESOLVED: Problem had to do with Python version, refer to stackoverflow.com/a/5513856/2540382

I am fiddling with htm -> txt file conversion and am having a little trouble. My project is essentially to convert the messages.htm file I downloaded of my Facebook chat history into a messages.txt file with all the <> brackets removed and formatting preserved.

The file messages.htm is parsed into variable text.

I then run:

target = open('output.txt', 'w')
target.write(text)
target.close

This seems to work except when I hit an invalid character. As seen in the error below. Is there a way to either:

Skip the line with the invalid character while writing?
Figure out where the invalid characters are and remove the corresponding character or line?

The desired outcome is to avoid having strange characters all together if possible.

return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\U000fe333' in position 37524: character
maps to

Ken Geis · Accepted Answer

target = open('output.txt', 'wb')
target.write(text.encode('ascii', 'ignore'))
target.close()

For the "errors" argument to .encode(..), 'ignore' will strip out those characters, and 'replace' will replace them with '?'.

To test this, I replaced the write line with

target.write(u"foo\U000fe333bar".encode("ascii", "ignore"))

and confirmed that output.txt contained only "foobar".

UPDATE: I edited the open(.., 'w') to open(.., 'wb') to make sure this would work in Python 3 as well.

Python Write to Text File Skip Bad Lines

Answers (1)

Related Questions