João Areias
João Areias

Reputation: 1428

Dask using too much memory when reading parquet

I have a rather large parquet file (~1.35Gb) that I'm trying to read. I'm using Google Colab Pro which gives me 25 Gb of RAM. I ran the following code:

import dask.dataframe as dd
data = dd.read_parquet(DATA_DIR / 'train.parquet', chunksize=100)
data.head()

And ran out of memory, is there something I can do to improve the memory consumption? I tried different chunk sizes, as well as removing it entirely, but all run out of memory.

Upvotes: 1

Views: 495

Answers (1)

SultanOrazbayev
SultanOrazbayev

Reputation: 16551

Docs warn that chunksize will be deprecated. Moreover, the value you provided is rather small (this is interpreted to be a bytes value), which will result in too many partitions. Without a reproducible example it's hard to be more specific, but I would recommend using the default settings:

from dask.dataframe import read_parquet
data = read_parquet(DATA_DIR / 'train.parquet')
data.head()  # hopefully works

As suggested by @MichaelDelgado, if the parquet file has more than one row_group, it's possible to use split_row_groups=True to have smaller partitions:

from dask.dataframe import read_parquet
data = read_parquet(DATA_DIR / 'train.parquet', split_row_groups=True)
data.head()  # hopefully works

Upvotes: 1

Related Questions