dhs4402
dhs4402

Reputation: 143

Read Large Json in Python and take a slice as a sample

I'am dealing a really large json file (6.5GB), with my local machine, it's impossible to read it all at once. So I want to read a chunk as a testing sample and write code based on this sample before running on the entire dataset.

import pandas as pd


file_dir = 'D://yelp_dataset/yelp_academic_dataset_review.json'

df_review_sample = pd.read_json(file_dir, lines=True, chunksize=1000)

I made the following try and then df_review_sample become a JsonReader Object. Is there a way to show the first chunk as a dataframe?

Upvotes: 2

Views: 5362

Answers (1)

Max
Max

Reputation: 185

I got the same issue last afternoon, and I finally understood what's going on.

Using the arg lines=True and chunksize=X will create a reader that get specific number of lines.

Then you have to make a loop to display each chunk.

Here is a piece of code for you to understand :

import pandas as pd
import json
chunks = pd.read_json('../input/data.json', lines=True, chunksize = 10000)
for chunk in chunks:
    print(chunk)
    break

Chunks create a multiple of chunks according to the lenght of your json (talking in lines). For example, I have a 100 000 lines json with X objects in it, if I do chunksize = 10 000, I will have 10 chunks.

In the code that I gave I added a break in order to just print the first chunk but if you remove it, you will have 10 chunks one by one.

Upvotes: 2

Related Questions