Mark
Mark

Reputation: 389

Performance issues formatting using .json()

I am trying to load data from a file located at some URL. I use requests to get it (this happens plenty fast). However, it takes about 10 minutes to use r.json() to format part of the dictionary. How can I speed this up?

match_list = []
for i in range(1, 11):
    r = requests.get('https://s3-us-west-1.amazonaws.com/riot-api/seed_data/matches%d.json' % i)
    print('matches %d of 10 loaded' % i)
    match_list.append(r.json()['matches'])
    print('list %d of 10 created' % i)
match_histories = {}
match_histories['matches'] = match_list

I know that there is a related question here: Performance problem transforming JSON data , but I don't see how I can apply that to my case. Thanks! (I'm using Python 3).

Edit:

I have been given quite a few suggestions that seem promising, but with each I hit a roadblock.

Upvotes: 2

Views: 5110

Answers (4)

Kirk Strauser
Kirk Strauser

Reputation: 30937

The built-in JSON parser isn't particularly fast. I tried another parser, python-cjson, like so:

import requests
import cjson

r = requests.get('https://s3-us-west-1.amazonaws.com/riot-api/seed_data/matches1.json')
print cjson.decode(r.content)

The whole program took 3.7 seconds on my laptop, including fetching the data and formatting the output for display.

Edit: Wow, we were all on the wrong track. json isn't slow; Requests's charset detection is painfully slow. Try this instead:

import requests
import json

r = requests.get('https://s3-us-west-1.amazonaws.com/riot-api/seed_data/matches1.json')
r.encoding = 'UTF-8'
print json.loads(r.text)

The json.loads part takes 1.5s on my same laptop. That's still slower than cjson.decode (at only .62s), but may be fast enough that you won't care if this isn't something you run very frequently. Caveat: I've only benchmarked this on Python2 and it might be different on Python3.

Edit 2: It seems cjson doesn't install in Python3. That's OK: json.loads in this version only takes .54 seconds. Charset detection is still glacial, though, and commenting the r.encoding = 'UTF-8' still makes the test script run in O(eternal) time. If you can count on those files always being UTF-8 encoded, I think the performance secret is to put that information in your script so that it doesn't have to figure this out at runtime. With that boost, you don't need to bother with supplying your own JSON parser. Just run:

import requests

r = requests.get('https://s3-us-west-1.amazonaws.com/riot-api/seed_data/matches1.json')
r.encoding = 'UTF-8'
print r.json()

Upvotes: 8

BrenBarn
BrenBarn

Reputation: 251373

It looks like requests uses simplejson to decode the JSON. If you just get the data with r.content and then use the builtin Python json library, json.loads(r.content) works very quickly. It works by raising an error for invalid JSON, but that's better than hanging for a long time.

Upvotes: 1

D0r1an
D0r1an

Reputation: 46

Well that's a pretty big file you have there and pure python code (I suspect the requests library doesn't use C bindings for JSON parsing) is often rather slow. Do you really need all the data? If you only need some parts of it, maybe you can find a faster way to find it or use a different API if it is available.

You could also try to use a faster JSON library by using a library like ujson: https://pypi.python.org/pypi/ujson

I didn't try this one myself but it claims to be fast. You can then just call ujson.loads(r.text) to obtain your data.

Upvotes: 0

nivix zixer
nivix zixer

Reputation: 1651

I would recommend using a streaming JSON parser (take a look at ijson). A streaming approach will increase your memory efficient for the parsing step, but your program may still be sluggish since you are storing a rather large dataset in memory.

Upvotes: 0

Related Questions