Smajjk
Smajjk

Reputation: 394

Read large JSON file on disk

I have a large JSON file, db.json (> 100 Mb) with the following content:

{"sitters": [["9919.html", 3, 8, 19, 47, 120, 129, 359]], "yellow": [["9945.html", 791], 
["9983.html", 1496], ["9984.html", 151]], "four": [["9971.html", 81, 403], ["9991.html", 37], 
["9995.html", 45, 225, 337], ["9975.html", 15], ["9978.html", 100], ["9948.html", 381], 
["9966.html", 228], ...

where the keys are words and the values are filenames followed by the index where the word appear in the file. I would like to query n number of words from this JSON file and then retrieve their corresponding filenames and positions. Any idea of how to do this efficiently given the large file size? I have been looking at IJSON but I can't seem to get it to work. I have tried:

parser = parse("db.json")                                                             
for prefix, event, value in parser:                                                  
    if event == 'sitters':                                                           
        print value   

But I might not understand how to use it properly because it gives me the following error:

Traceback (most recent call last):
  File "retriever.py", line 43, in <module>
    sys.exit(main())
  File "retriever.py", line 38, in main
    for prefix, event, value in parser:
  File "/usr/local/lib/python2.7/dist-packages/ijson/common.py", line 63, in parse
    for event, value in basic_events:
  File "/usr/local/lib/python2.7/dist-packages/ijson/backends/yajl2.py", line 90, in basic_parse
    buffer = f.read(buf_size)
AttributeError: 'str' object has no attribute 'read'

Any help is highly appreciated!

Upvotes: 1

Views: 894

Answers (1)

tamasgal
tamasgal

Reputation: 26259

You're trying to parse the string 'db.json' instead of the file 'db.json' in this line:

parser = parse("db.json")                                                             

As you can see in the error message, the line buffer = f.read(buf_size) throws this exception:

AttributeError: 'str' object has no attribute 'read'

The function parse requires a file though:

f = open('db.json', 'r')
parser = parse(f)

and close it after your work is done:

f.close()

You can also handle the open and close process using the with statement:

with open('db.json') as f:
    parser = parse(f)
    # use your parser and after leaving this block indent you're done

Upvotes: 4

Related Questions