Reputation: 63707
I am running Dask on a single computer where running .compute()
to perform the computations on a huge parquet file will cause dask to use up all the CPU cores on the system.
import dask as dd
df = dd.read_parquet(parquet_file) # very large file
print(df.names.unique().compute())
Is it possible to configure dask to use a specific number of CPU cores and limit its memory usage to say 32 GB? Using Python 3.7.2 and Dask 2.9.2.
Upvotes: 6
Views: 2928
Reputation: 4004
Dask.distributed.Client creates a LocalCluster for which you can explicitly set the memory use and the number of cores.
import numpy as np
import pandas as pd
from dask.distributed import Client
from dask import dataframe as dd
def names_unique(x):
return x['Names'].unique()
client = Client(memory_limit='2GB', processes=False,
n_workers=1, threads_per_worker=2)
# Data generation
df = pd.DataFrame({'Names': np.random.choice(['A', 'B', 'C', 'D'], size=1000000),
'sales': np.arange(1000000)})
df.to_parquet('parq_df')
ddf = dd.read_parquet('parq_df', npartitions=10)
# Custom computation
sent = client.submit(names_unique, ddf)
names_unique = sent.result().compute()
client.close()
Output:
names_unique
Out[89]:
0 D
1 B
2 C
3 A
Name: Names, dtype: object
Upvotes: 3