user3476463
user3476463

Reputation: 4575

Speed Up Loading json data into dataframe

I'm trying to read a very large set of nested json files in to a pandas dataframe, using the code below. It's a few million records, it's the "review" file from the yelp academic dataset.

Does anyone know a quicker way to do this?

Is it possible to just load a sample of the json records? I would probably be fine with just a couple hundred thousand records.

Also I probably don't need all the fields from the review.json file, could I just load a subset of them like user_id, business_id, stars? And would that speed things up?

I would post sample data but I can't even get it to finish loading.

Code:

df_review = pd.read_json('dataset/review.json', lines=True)

Update:

Code:

reviews = ''

with open('dataset/review.json','r') as f:
    for line in f.readlines()[0:1000]:
        reviews += line

testdf = pd.read_json(reviews,lines=True)

Error:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-18-8e4a45990905> in <module>()
      5         reviews += line
      6 
----> 7 testdf = pd.read_json(reviews,lines=True)

/Users/anaconda/lib/python2.7/site-packages/pandas/io/json.pyc in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines)
    273         # commas and put it in a json list to make a valid json object.
    274         lines = list(StringIO(json.strip()))
--> 275         json = u'[' + u','.join(lines) + u']'
    276 
    277     obj = None

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 357: ordinal not in range(128)

Update 2:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

reviews = ''

with open('dataset/review.json','r') as f:
    for line in f.readlines()[0:1000]:
        reviews += line

testdf = pd.read_json(reviews,lines=True)

Upvotes: 6

Views: 6484

Answers (3)

shouldsee
shouldsee

Reputation: 484

I agree with @Nathan H 's proposition. But the precise point will probably lies in parallelization.

import pandas as pd  
buf = ''
buf_lst = []
df_lst = []
chunk_size = 1000
with open('dataset/review.json','r') as f:
    lines = f.readlines()
    buf_lst += [ ''.join(lines[x:x+chunk_size]) for x in range(0,len(lines), chunk_size)]

def f(buf):
    return pd.read_json( buf,lines=True)

#### single-thread
df_lst = map( f, buf_lst)

#### multi-thread
import multiprocessing as mp
pool = mp.Pool(4)
df_lst = pool.map( f, buf_lst)
pool.join()
pool.close()

However, I am not sure how to combine pandas dataframe yet.

Upvotes: 1

Nathan H
Nathan H

Reputation: 346

If your file has json objects line separated as you imply, this might be able to work. Just reading the first 1000 lines of the file and then reading that with pandas.

import pandas as pd  

reviews = ''

with open('dataset/review.json','r') as f:
    for line in f.readlines()[0:1000]:
        reviews += line

pd.read_json(reviews,lines=True)

Upvotes: 2

Or Duan
Or Duan

Reputation: 13810

Speeding up that one line would be challenging because it's already super optimized.

I would first check if you can get less rows/data from the provider, as you mentioned.

If you can process the data before, I would recommend to convert it to JSON before(even try different parsers, their performance changes for each dataset structure), than save just the data you need, and with this output call the pandas method.

Here you can find some benchmark of json parsers, keep in mind that you should test on your data, this article is from 2015.

Upvotes: 0

Related Questions