Ryan McCormick
Ryan McCormick

Reputation: 231

Purpose of compute() in Dask

What're the logistics behind having the extra .compute() in the numpy and pandas mimicked functionalities? Is it just to support some kind of lazy evaluation?

Example from Dask documentation below:

import pandas as pd                     import dask.dataframe as dd
df = pd.read_csv('2015-01-01.csv')      df = dd.read_csv('2015-*-*.csv')
df.groupby(df.user_id).value.mean()     df.groupby(df.user_id).value.mean().compute()

Upvotes: 5

Views: 4260

Answers (1)

MRocklin
MRocklin

Reputation: 57271

Yes, your intution is correct here. Most Dask collections (array, bag, dataframe, delayed) are lazy by default. Normal operations are lazy while calling compute actually triggers execution.

This is important both so that we can do bit of light optimization, and also so that we can support low-memory execution. For example, if you were to call

x = da.ones(1000000000000)
total = x.sum()

And if we ran immediately then there would be a time where we thought that you wanted the full array computed, which would be unfortunate if you were on a single machine. But if we know that you only want total.compute() then we can compute this thing in much smaller memory.

Upvotes: 2

Related Questions