Reputation: 1428
I have a rather large parquet file (~1.35Gb) that I'm trying to read. I'm using Google Colab Pro which gives me 25 Gb of RAM. I ran the following code:
import dask.dataframe as dd
data = dd.read_parquet(DATA_DIR / 'train.parquet', chunksize=100)
data.head()
And ran out of memory, is there something I can do to improve the memory consumption? I tried different chunk sizes, as well as removing it entirely, but all run out of memory.
Upvotes: 1
Views: 495
Reputation: 16551
Docs warn that chunksize
will be deprecated. Moreover, the value you provided is rather small (this is interpreted to be a bytes value), which will result in too many partitions. Without a reproducible example it's hard to be more specific, but I would recommend using the default settings:
from dask.dataframe import read_parquet
data = read_parquet(DATA_DIR / 'train.parquet')
data.head() # hopefully works
As suggested by @MichaelDelgado, if the parquet file has more than one row_group, it's possible to use split_row_groups=True
to have smaller partitions:
from dask.dataframe import read_parquet
data = read_parquet(DATA_DIR / 'train.parquet', split_row_groups=True)
data.head() # hopefully works
Upvotes: 1