boson
boson

Reputation: 894

Memory issues while parsing json file in ijson

This tutorial https://www.dataquest.io/blog/python-json-tutorial/ has a 600MB file that they work with, however when I run their code

import ijson

filename = "md_traffic.json"
with open(filename, 'r') as f:
    objects = ijson.items(f, 'meta.view.columns.item')
    columns = list(objects)

I'm running into 10+ minutes of waiting for the file to be read into ijson and I'm really confused how this is supposed to be reasonable. Shouldn't there be parsing? Am I missing something?

Upvotes: 4

Views: 1643

Answers (3)

Rodrigo Tobar
Rodrigo Tobar

Reputation: 659

The main problem is not that you are creating a list after parsing (that only collects the individual results into a single structure), but that you are using the default pure-python backend provided by ijson.

There are other backends that can be used which are way faster. In ijson's homepage it is explained how you can import those. The yajl2_cffi backend is the fastest currently available at the moment, but I've created a new yajl2_c backend (there's a pull request pending acceptance) that performs even better.

In my laptop (Intel(R) Core(TM) i7-5600U) using the yajl2_cffi backend your code runs in ~1.5 minutes. Using the yajl2_c backend it runs in ~10.5 seconds (python 3) and ~15 seconds (python 2.7.12).

Edit: @lex-scarisbrick is of course also right in that you can quickly break out of the loop if you are only interested in the column names.

Upvotes: 2

Lex Scarisbrick
Lex Scarisbrick

Reputation: 1570

This looks like a direct copy/paste of the tutorial found here:

https://www.dataquest.io/blog/python-json-tutorial/

The reason it's taking so long is the list() around the output of the ijson.items function. This effectively forces parsing of the entire file before returning any results. Taking advantage of the ijson.items being a generator, the first result can be returned almost immediately:

import ijson

filename = "md_traffic.json"
with open(filename, 'r') as f:
    for item in ijson.items(f, 'meta.view.columns.item'):
        print(item)
        break

EDIT: The very next step in the tutorial is print(columns[0]), which is why I included printing the first item in the answer. Also, it's not clear whether the question was for Python 2 or 3, so the answer uses syntax that works in both, albeit inelegantly.

Upvotes: 1

wind85
wind85

Reputation: 487

I tried running your code and I killed the program after 25 minutes. So yes 10 minutes it's reasonable fast.

Upvotes: 1

Related Questions