kan-don
kan-don

Reputation: 23

Why can't I load this json file in python?

Solution, if someone finds this when googling it:
The problem was not with the code per se, but with the download on Firefox. Apparently (see https://bugzilla.mozilla.org/show_bug.cgi?id=1470011) some servers will gzip files twice. The downloaded file should then be called file.json.gz.gz, but has one .gz missing. It needs to be extracted twice to get to the content.


I am trying to sort through some information in this file: https://dl.vndb.org/dump/vndb-tags-latest.json.gz I am also very new to working with json, but I can't find anything that helps me.

The problem is that I can't get it to load it into python. Extracting the .gz file with 7zip and trying to load the file with json.load(open('vndb-tags-2020-12-31.json', encoding='utf-8')) returns the error

>>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte.

Without the utf-8 argument I get

>>> UnicodeDecodeError: 'cp932' codec can't decode byte 0x8b in position 1: illegal multibyte sequence

instead. I run into the same problem when I try to decrypt the file on the go using the gzip package

import gzip
with gzip.open('vndb-tags-2020-12-31.json.gz') as fd:
    json.load(fd)
>>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

I am guess that I need a different encoding option, but utf-16 and 32 don't work and I can't find anything on the help page https://vndb.org/d14

Upvotes: 1

Views: 1198

Answers (1)

DRPK
DRPK

Reputation: 2091

You can download, extract and load your data as json blow-by-blow;

  1. Send request to target url and catch the data as byte-object
  2. Load and reserve its data on memory cell with io module.
  3. Pass the io-object to gzip function and extract it to json data
  4. Pass the json string to dump property and reserve it as a python dictionary

Try this:

import requests, io, gzip, json


url = 'https://dl.vndb.org/dump/vndb-tags-latest.json.gz'
file_object = io.BytesIO(requests.get(url).content)

with gzip.open(file_object, 'r') as gzip_file:
    reserve_data = gzip_file.read()

load_json = json.loads(reserve_data)
beautiful_json = json.dumps(load_json, sort_keys=True, indent=4)
print(beautiful_json)

For larger files its better to save your gzip on disk, then load it from disk:

import requests, gzip, json

target_url = 'https://dl.vndb.org/dump/vndb-tags-latest.json.gz'
downloaded_gzip_file = requests.get(target_url).content

with open("my_json_file.gz", "wb") as gz_file:
    gz_file.write(downloaded_gzip_file)

with gzip.open("my_json_file.gz") as gz_file:
    load_json_data = json.load(gz_file)

beautiful_json = json.dumps(load_json_data, sort_keys=True, indent=4)
print(beautiful_json)

Upvotes: 2

Related Questions