Hirak Sarkar
Hirak Sarkar

Reputation: 519

Opening A large JSON file

I have a 1.7 GB JSON file when I am trying to open with json.load() then it is giving memory error, How could read the JSON file in python?

My JSON file is a big array of objects containing specific keys.

Edit: Of course if each item in the (outermost) array appears on a single line, then we could read the file one line at a time, along the lines of:

>>>for line in open('file.json','r').readline():
...    do something with(line) 

Upvotes: 17

Views: 10252

Answers (5)

peak
peak

Reputation: 116750

There's a CLI wrapper around ijson that I created precisely for ease of processing very large JSON documents.

In your case you could simply pipe the "big array of objects" to jm.py, and it will emit each top-level object on a separate line for piping into another process.

jm.py has various options which you might also find relevant.

The same repository has a similar script, jm, which I mention as it seems typically to be significantly faster, but it is PHP-based.

Upvotes: 0

kilozulu
kilozulu

Reputation: 347

I've used Dask for large telemetry JSON-Lines files (newline delimited)...
The nice thing with Dask is it does a lot of work for you.
With it, you can read the data, process it, and write to disk without reading it all into memory.
Dask will also parallelize for you and use multiple cores (threads)...

More info on Dask bags here:
https://examples.dask.org/bag.html

import ujson as json #ujson for speed and handling NaNs which are not covered by JSON spec
import dask.bag as db

def update_dict(d):
    d.update({'new_key':'new_value', 'a':1, 'b':2, 'c':0})
    d['c'] = d['a'] + d['b']
    return d

def read_jsonl(filepaths):
    """Read's a JSON-L file with a Dask Bag

    :param filepaths: list of filepath strings OR a string with wildcard
    :returns: a dask bag of dictionaries, each dict a JSON object
    """
    return db.read_text(filepaths).map(json.loads)



filepaths = ['file1.jsonl.gz','file2.jsonl.gz']
#OR
filepaths = 'file*.jsonl.gz' #wildcard to match multiple files

#(optional) if you want Dask to use multiple processes instead of threads
# from dask.distributed import Client, progress
# client = Client(threads_per_worker=1, n_workers=6) #6 workers for 6 cores
# print(client)

#define bag containing our data with the JSON parser
dask_bag = read_jsonl(filepaths)

#modify our data
#note, this doesn't execute, it just adds it to a queue of tasks
dask_bag.map(update_dict)

#(optional) if you're only reading one huge file but want to split the data into multiple files you can use repartition on the bag
# dask_bag = dask_bag.repartition(10)

#write our modified data back to disk, this is when Dask actually performs execution
dask_bag.map(json.dumps).to_textfiles('file_mod*.jsonl.gz') #dask will automatically apply compression if you use .gz

Upvotes: 1

Yaroslav Stavnichiy
Yaroslav Stavnichiy

Reputation: 21446

I have found another python wrapper around yajl library, which is ijson.

It works better for me than yajl-py due to the following reasons:

  • yajl-py did not detect yajl library on my system, I had to hack the code in order to make it work
  • ijson code is more compact and easier to use
  • ijson can work with both yajl v1 and yajl v2, and it even has pure python yajl replacement
  • ijson has very nice ObjectBuilder, which helps extracting not just events but meaningful sub-objects from parsed stream, and at the level you specify

Upvotes: 5

Croad Langshan
Croad Langshan

Reputation: 2800

I found yajl (hence ijson) to be much slower than module json when a large data file was accessed from local disk. Here is a module that claims to perform better than yajl/ijson (still slower than json), when used with Cython:

http://pietrobattiston.it/jsaone

As the author points out, performance may be better than json when the file is received over the network since an incremental parser can start parsing sooner.

Upvotes: 0

georg
georg

Reputation: 214969

You want an incremental json parser like yajl and one of its python bindings. An incremental parser reads as little as possible from the input and invokes a callback when something meaningful is decoded. For example, to pull only numbers from a big json file:

class ContentHandler(YajlContentHandler):
    def yajl_number(self, ctx, val):
         list_of_numbers.append(float(val))

parser = YajlParser(ContentHandler())
parser.parse(some_file)

See http://pykler.github.com/yajl-py/ for more info.

Upvotes: 15

Related Questions