sebach1
sebach1

Reputation: 117

Reading big json dataset using pandas with chunks

I want to read a json of 6gb size (and I've another of 1.5gb), and i tried to read normally with pandas (just with pd.read_json), and clearly memory dies. Then, I tried with chunksize param, like:

with open('data/products.json', encoding='utf-8') as f:
    df = []
    df_reader = pd.read_json(f, lines=True, chunksize=1000000)
    for chunk in df_reader:
        df.append(chunk)
data = pd.read_json(df)

But that doesn't work too, and my pc dies on the first running minute (8gb RAM actually).

Upvotes: 2

Views: 4777

Answers (1)

Charles Landau
Charles Landau

Reputation: 4265

Dask and Pyspark has dataframe solutions that are nearly identical to pandas. Pyspark is a Spark api and distributes workloads across JVMs. Dask specifically targets the out-of-memory on a single workstation use case and implements the dataframe api.

As shown here read_json's api mostly passes through from pandas.

As you port your example code from the question, I would note two things:

  1. I suspect you won't need the file context manager, as simply passing the file path probably works.

  2. If you have multiple records, Dask supports blobs like "path/to/files/*.json"

Upvotes: 1

Related Questions