Joseph Jones
Joseph Jones

Reputation: 187

Remove non-UTF8 characters from file contents

I'm trying to read usernames from a database and if there are non-UTF-8 characters, it throws UnicodeDecodeError.

I'm unsure of what all the non-UTF8 characters are and I'm looking for a solution.

I want to keep special symbols, but just filter out the ones that aren't compatible with UTF-8. ³ and (trademark), don't work with UTF-8, they're the only two I know of.

I still want to keep chinese symbols, arabic, etc. That's why I'm using UTF8.

Code:

def is_author_used(author):
        with open("C:\\Users\\Administrator\\Desktop\\authors.txt", 'r', encoding='utf-8') as f:
            content = f.read().splitlines()
        if author in content:
            return True
        return False

    def set_author_used(author):
        with open("C:\\Users\\Administrator\\Desktop\\authors.txt", 'a', encoding='utf-8') as f:
            f.write(author + '\r\n')

Upvotes: 1

Views: 6919

Answers (1)

Danil Speransky
Danil Speransky

Reputation: 30453

Maybe something like this:

with open('text.txt', encoding='utf-8', errors='ignore') as f:
    content = f.read().splitlines()

Upvotes: 3

Related Questions