Duk
Duk

Reputation: 25

Saving GIANT JSONP to Database

Hey I got this JSONP from archive.org:

https://archive.org/advancedsearch.php?q=collection%3Ainternetarchivebooks&fl[]=creator&fl[]=format&fl[]=genre&fl[]=language&fl[]=name&fl[]=title&fl[]=type&fl[]=year&sort[]=&sort[]=&sort[]=&rows=100000000&page=1&output=json&callback=callback&save=yes

this here i one with only 2 outputs:

https://archive.org/advancedsearch.php?q=collection%3Ainternetarchivebooks&fl[]=creator&fl[]=format&fl[]=genre&fl[]=language&fl[]=name&fl[]=title&fl[]=type&fl[]=year&sort[]=&sort[]=&sort[]=&rows=100000000&page=1&output=json&callback=callback&save=yes

is this a jsonp? how to save this?

and i want to save it into smaller .json files. Or directly insert it into a Database without running out of RAM. Full Size is 2GB. I downloaded it, splitted it to remove the "callback( )", put it back together and tried to cut it into smaller .JSON files with python "json.loads()". But it seems to be Corrupted at one Point.

So my question is, how to handle this Giant JQUERY? Is there a way to stream it directly from the online presence into a Database?

what would you do? In the end it shall show up in my Database. My first step was to create smaller JSON files and then handle these. Is there an easyer way?

I tried this, but it doesn't seem right:

import os
import json
import requests

# specify the URL where the JSONP data is located
url = 'https://archive.org/advancedsearch.php?q=collection%3Ainternetarchivebooks&fl[]=creator&fl[]=format&fl[]=genre&fl[]=language&fl[]=name&fl[]=title&fl[]=type&fl[]=year&sort[]=&sort[]=&sort[]=&rows=100000000&page=1&output=json&callback=callback&save=yes'

# set the size of each chunk
size_of_the_chunk = 2000

# create a new directory to save the smaller JSON files
dir_name = 'data_split'
if not os.path.exists(dir_name):
    os.mkdir(dir_name)

# send a request to the URL and get the JSONP response
response = requests.get(url, stream=True)
jsonp = ''
for chunk in response.iter_content(chunk_size=1024):
    if chunk:
        jsonp += chunk.decode()

# split the JSONP data into smaller JSON lists and save each list to a separate file
count = 0
for start_idx in range(0, len(jsonp), size_of_the_chunk):
    end_idx = start_idx + size_of_the_chunk
    json_str = jsonp[start_idx:end_idx]
    start = json_str.index('(') + 1
    end = json_str.rindex(')')
    data = json.loads(json_str[start:end])
    filename = os.path.join(dir_name, f'{count+1}.json')
    with open(filename, 'w') as f:
        json.dump(data, f, ensure_ascii=False, indent=True)
    count += 1

print(f'Successfully split {count * size_of_the_chunk} records into {count} files.')

Upvotes: 0

Views: 22

Answers (1)

Duk
Duk

Reputation: 25

I got it. I have to threat it as JavaScript..

import json
    
# Read data from JavaScript-file
with open('data.json', 'r') as file:
data = file.read()
    
# Cleaning JSONP-Formats & Formatting to JSON
json_data = json.loads(data[data.index('{'):data.rindex('}')+1])
    
# Save to new JSON file
with open('tweets.json', 'w') as file:
json.dump(json_data, file)

now its JSON. Got it in my DB..

Upvotes: 0

Related Questions