Woooooooo
Woooooooo

Reputation: 85

dask out of memory error despite data size doesnt exceed memory

I'm trying to load a dask dataframe from a MySQL table which takes about 4gb space on disk. I'm using a single machine with 8gb of memory but as soon as I do a drop duplicate and try to get the length of the dataframe, an out of memory error is encountered.

Here's a snippet of my code:

df = dd.read_sql_table("testtable", db_uri, npartitions=8, index_col=sql.func.abs(sql.column("id")).label("abs(id)"))
df = df[['gene_id', 'genome_id']].drop_duplicates()
print(len(df))

I have tried more partitions for the dataframe(as many as 64) but they also failed. I'm confused why this could cause an OOM? The dataframe should fit in memory even without any parallel processing.

Upvotes: 1

Views: 666

Answers (2)

Powers
Powers

Reputation: 19308

You don't want to read an entire DataFrame into a Dask DataFrame and then perform filtering in Dask. It's better to perform filtering at the database level and then read a small subset of the data into a Dask DataFrame.

MySQL can select columns and drop duplicates with distinct. The resulting data is what you should read in the Dask DataFrame.

See here for more information on syntax. It's easiest to query databases that have official connectors, like dask-snowflake.

Upvotes: 0

mdurant
mdurant

Reputation: 28673

which takes about 4gb space on disk

It is very likely to be much much bigger than this in memory. Disk storage is optimised for compactness, with various encoding and compression mechanisms.

The dataframe should fit in memory

So, have you measured its size as a single pandas dataframe?

You should also keep in mind than any processing you do to your data often involves making temporary copies within functions. For example, you can only drop duplicates by first finding duplicates, which must happen before you can discard any data.

Finally, in a parallel framework like dask, there may be multiple threads and processes (you don't specify how you are running dask) which need to marshal their work and assemble the final output while the client and scheduler also take up some memory. In short, you need to measure your situation, perhaps tweak worker config options.

Upvotes: 1

Related Questions