Roger Steve
Roger Steve

Reputation: 21

Getting UnicodeDecodeError loading JSON file containing Chinese

I am trying to load a json file. The content of the file is in Chinese language. I am getting UnicodeDecodeError:utf-8. Is there any way to use try-except without losing all the content from the file?

def load_from_json(fin):
    datas = []
    for line in fin:
        data = json.loads(line)
        datas.append(data)
    return datas

Screenshot of the error

enter image description here

Upvotes: 2

Views: 279

Answers (2)

Chipaca
Chipaca

Reputation: 387

It does look like the file might not actually be utf8, so that is indeed a good place to start, as per the other answer. However, to answer your actual question,

Is there any way to use try-except without losing all the content from the file?

yes, there are two ways: one is that as well as setting encoding="utf8", set errors="replace". Then you'll get a Replacement Character U+FFFD (�) and things will continue as they were. You then try/except the json load and go from there. This is probably the simplest, but also not a very good solution for a long-term thing.

A better way would be to instead open the file in binary mode and do the decoding line by line, something like perhaps

def load_from_json(fin):
    datas = []
    for i, line in enumerate(fin):
        try:
            data = json.loads(line.decode("utf8"))
        except UnicodeDecodeError as e:
            print(f"line {i}, {line!r}: {e}", file=sys.stderr)
        else:
            datas.append(data)
    return datas

Upvotes: 0

James McGuigan
James McGuigan

Reputation: 8086

This may potentially be an issue with character encodings. There is a library called ftfy (Fixed That For You) which may be able to autodetect and auto-fix your character encodings:

Upvotes: 1

Related Questions