Remove non-UTF8 characters from file contents

Question

I'm trying to read usernames from a database and if there are non-UTF-8 characters, it throws UnicodeDecodeError.

I'm unsure of what all the non-UTF8 characters are and I'm looking for a solution.

I want to keep special symbols, but just filter out the ones that aren't compatible with UTF-8. ³ and ™ (trademark), don't work with UTF-8, they're the only two I know of.

I still want to keep chinese symbols, arabic, etc. That's why I'm using UTF8.

Code:

def is_author_used(author):
        with open("C:\Users\Administrator\Desktop\authors.txt", 'r', encoding='utf-8') as f:
            content = f.read().splitlines()
        if author in content:
            return True
        return False

    def set_author_used(author):
        with open("C:\Users\Administrator\Desktop\authors.txt", 'a', encoding='utf-8') as f:
            f.write(author + '
')

Danil Speransky · Accepted Answer

Maybe something like this:

with open('text.txt', encoding='utf-8', errors='ignore') as f:
    content = f.read().splitlines()

Remove non-UTF8 characters from file contents

Answers (1)

Related Questions