Reputation: 231
What're the logistics behind having the extra .compute()
in the numpy
and pandas
mimicked functionalities? Is it just to support some kind of lazy evaluation?
Example from Dask documentation below:
import pandas as pd import dask.dataframe as dd
df = pd.read_csv('2015-01-01.csv') df = dd.read_csv('2015-*-*.csv')
df.groupby(df.user_id).value.mean() df.groupby(df.user_id).value.mean().compute()
Upvotes: 5
Views: 4260
Reputation: 57271
Yes, your intution is correct here. Most Dask collections (array, bag, dataframe, delayed) are lazy by default. Normal operations are lazy while calling compute actually triggers execution.
This is important both so that we can do bit of light optimization, and also so that we can support low-memory execution. For example, if you were to call
x = da.ones(1000000000000)
total = x.sum()
And if we ran immediately then there would be a time where we thought that you wanted the full array computed, which would be unfortunate if you were on a single machine. But if we know that you only want total.compute()
then we can compute this thing in much smaller memory.
Upvotes: 2