Reputation: 1591
I scraped some text found in a Taiwanese website. I got rid of the HTML and only kept what I needed as txt files. The content of the txt file displays correctly in Firefox/Chrome. With Python3, if I do f = open(text_file).read()
I get this error:
'utf-8' codec can't decode byte 0xa1 in position 29: invalid start byte
ETA: I use ubuntu, so I'm happy for any solution in Python or in the terminal!
And if I do f = codecs.open(os.path.join(path, 'my_text.txt'), 'r', encoding='Big5')
and then read()
I get this message:
'big5' codec can't decode byte 0xf9 in position 1724: illegal multibyte sequence
I only need the Chinese characters, how can I only keep those encoded as Big5? This would get rid of the error,yes?
Upvotes: 0
Views: 1730
Reputation: 414665
The builtin open()
function has errors
parameter:
with open(filename, encoding='utf-8', errors='replace') as file:
text = file.read()
It is possible that your file uses some other character encoding or even (if the code that saves the text is buggy) a mixture of several character encodings.
You can see what encoding is used by your browser e.g., in Chrome: "More tools -> Encoding".
Upvotes: 2