Reputation: 143
I'am dealing a really large json file (6.5GB), with my local machine, it's impossible to read it all at once. So I want to read a chunk as a testing sample and write code based on this sample before running on the entire dataset.
import pandas as pd
file_dir = 'D://yelp_dataset/yelp_academic_dataset_review.json'
df_review_sample = pd.read_json(file_dir, lines=True, chunksize=1000)
I made the following try and then df_review_sample
become a JsonReader Object.
Is there a way to show the first chunk as a dataframe?
Upvotes: 2
Views: 5362
Reputation: 185
I got the same issue last afternoon, and I finally understood what's going on.
Using the arg lines=True and chunksize=X will create a reader that get specific number of lines.
Then you have to make a loop to display each chunk.
Here is a piece of code for you to understand :
import pandas as pd
import json
chunks = pd.read_json('../input/data.json', lines=True, chunksize = 10000)
for chunk in chunks:
print(chunk)
break
Chunks create a multiple of chunks according to the lenght of your json (talking in lines). For example, I have a 100 000 lines json with X objects in it, if I do chunksize = 10 000, I will have 10 chunks.
In the code that I gave I added a break in order to just print the first chunk but if you remove it, you will have 10 chunks one by one.
Upvotes: 2