Reputation: 4575
I'm trying to read a very large set of nested json files in to a pandas dataframe, using the code below. It's a few million records, it's the "review" file from the yelp academic dataset.
Does anyone know a quicker way to do this?
Is it possible to just load a sample of the json records? I would probably be fine with just a couple hundred thousand records.
Also I probably don't need all the fields from the review.json file, could I just load a subset of them like user_id, business_id, stars? And would that speed things up?
I would post sample data but I can't even get it to finish loading.
Code:
df_review = pd.read_json('dataset/review.json', lines=True)
Update:
Code:
reviews = ''
with open('dataset/review.json','r') as f:
for line in f.readlines()[0:1000]:
reviews += line
testdf = pd.read_json(reviews,lines=True)
Error:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-18-8e4a45990905> in <module>()
5 reviews += line
6
----> 7 testdf = pd.read_json(reviews,lines=True)
/Users/anaconda/lib/python2.7/site-packages/pandas/io/json.pyc in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines)
273 # commas and put it in a json list to make a valid json object.
274 lines = list(StringIO(json.strip()))
--> 275 json = u'[' + u','.join(lines) + u']'
276
277 obj = None
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 357: ordinal not in range(128)
Update 2:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
reviews = ''
with open('dataset/review.json','r') as f:
for line in f.readlines()[0:1000]:
reviews += line
testdf = pd.read_json(reviews,lines=True)
Upvotes: 6
Views: 6484
Reputation: 484
I agree with @Nathan H 's proposition. But the precise point will probably lies in parallelization.
import pandas as pd
buf = ''
buf_lst = []
df_lst = []
chunk_size = 1000
with open('dataset/review.json','r') as f:
lines = f.readlines()
buf_lst += [ ''.join(lines[x:x+chunk_size]) for x in range(0,len(lines), chunk_size)]
def f(buf):
return pd.read_json( buf,lines=True)
#### single-thread
df_lst = map( f, buf_lst)
#### multi-thread
import multiprocessing as mp
pool = mp.Pool(4)
df_lst = pool.map( f, buf_lst)
pool.join()
pool.close()
However, I am not sure how to combine pandas dataframe yet.
Upvotes: 1
Reputation: 346
If your file has json objects line separated as you imply, this might be able to work. Just reading the first 1000 lines of the file and then reading that with pandas.
import pandas as pd
reviews = ''
with open('dataset/review.json','r') as f:
for line in f.readlines()[0:1000]:
reviews += line
pd.read_json(reviews,lines=True)
Upvotes: 2
Reputation: 13810
Speeding up that one line would be challenging because it's already super optimized.
I would first check if you can get less rows/data from the provider, as you mentioned.
If you can process the data before, I would recommend to convert it to JSON before(even try different parsers, their performance changes for each dataset structure), than save just the data you need, and with this output call the pandas method.
Here you can find some benchmark of json parsers, keep in mind that you should test on your data, this article is from 2015.
Upvotes: 0