Reputation: 117
I want to read a json of 6gb size (and I've another of 1.5gb), and i tried to read normally with pandas (just with pd.read_json), and clearly memory dies. Then, I tried with chunksize param, like:
with open('data/products.json', encoding='utf-8') as f:
df = []
df_reader = pd.read_json(f, lines=True, chunksize=1000000)
for chunk in df_reader:
df.append(chunk)
data = pd.read_json(df)
But that doesn't work too, and my pc dies on the first running minute (8gb RAM actually).
Upvotes: 2
Views: 4777
Reputation: 4265
Dask and Pyspark has dataframe solutions that are nearly identical to pandas
. Pyspark is a Spark api and distributes workloads across JVMs. Dask specifically targets the out-of-memory on a single workstation use case and implements the dataframe api.
As shown here read_json
's api mostly passes through from pandas.
As you port your example code from the question, I would note two things:
I suspect you won't need the file context manager, as simply passing the file path probably works.
If you have multiple records, Dask supports blobs like "path/to/files/*.json"
Upvotes: 1