Reputation: 23
Solution, if someone finds this when googling it:
The problem was not with the code per se, but with the download on Firefox. Apparently (see https://bugzilla.mozilla.org/show_bug.cgi?id=1470011) some servers will gzip files twice. The downloaded file should then be called file.json.gz.gz, but has one .gz missing. It needs to be extracted twice to get to the content.
I am trying to sort through some information in this file: https://dl.vndb.org/dump/vndb-tags-latest.json.gz I am also very new to working with json, but I can't find anything that helps me.
The problem is that I can't get it to load it into python. Extracting the .gz file with 7zip and trying to load the file with json.load(open('vndb-tags-2020-12-31.json', encoding='utf-8'))
returns the error
>>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
.
Without the utf-8 argument I get
>>> UnicodeDecodeError: 'cp932' codec can't decode byte 0x8b in position 1: illegal multibyte sequence
instead. I run into the same problem when I try to decrypt the file on the go using the gzip package
import gzip
with gzip.open('vndb-tags-2020-12-31.json.gz') as fd:
json.load(fd)
>>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
I am guess that I need a different encoding option, but utf-16 and 32 don't work and I can't find anything on the help page https://vndb.org/d14
Upvotes: 1
Views: 1198
Reputation: 2091
You can download, extract and load your data as json blow-by-blow;
Try this:
import requests, io, gzip, json
url = 'https://dl.vndb.org/dump/vndb-tags-latest.json.gz'
file_object = io.BytesIO(requests.get(url).content)
with gzip.open(file_object, 'r') as gzip_file:
reserve_data = gzip_file.read()
load_json = json.loads(reserve_data)
beautiful_json = json.dumps(load_json, sort_keys=True, indent=4)
print(beautiful_json)
For larger files its better to save your gzip on disk, then load it from disk:
import requests, gzip, json
target_url = 'https://dl.vndb.org/dump/vndb-tags-latest.json.gz'
downloaded_gzip_file = requests.get(target_url).content
with open("my_json_file.gz", "wb") as gz_file:
gz_file.write(downloaded_gzip_file)
with gzip.open("my_json_file.gz") as gz_file:
load_json_data = json.load(gz_file)
beautiful_json = json.dumps(load_json_data, sort_keys=True, indent=4)
print(beautiful_json)
Upvotes: 2