How to only keep Big5 characters in a text file

Question

I scraped some text found in a Taiwanese website. I got rid of the HTML and only kept what I needed as txt files. The content of the txt file displays correctly in Firefox/Chrome. With Python3, if I do f = open(text_file).read() I get this error:

'utf-8' codec can't decode byte 0xa1 in position 29: invalid start byte

ETA: I use ubuntu, so I'm happy for any solution in Python or in the terminal!

And if I do f = codecs.open(os.path.join(path, 'my_text.txt'), 'r', encoding='Big5') and then read() I get this message:

'big5' codec can't decode byte 0xf9 in position 1724: illegal multibyte sequence

I only need the Chinese characters, how can I only keep those encoded as Big5? This would get rid of the error,yes?

jfs · Accepted Answer

The builtin open() function has errors parameter:

with open(filename, encoding='utf-8', errors='replace') as file:
    text = file.read()

It is possible that your file uses some other character encoding or even (if the code that saves the text is buggy) a mixture of several character encodings.

You can see what encoding is used by your browser e.g., in Chrome: "More tools -> Encoding".

How to only keep Big5 characters in a text file

Answers (1)

Related Questions