Reputation: 359
I am using Dask DataFrame to parallelize my following regex search code.
ddf = dd.from_pandas(in_desc, npartitions=16)
def r_s(dataframe1):
for vals in dataframe1:
for regex in dataframe.values:
if(re.search(regex[0], vals)):
pass
res = ddf.map_partitions(r_s, meta=ddf)
res.compute()
in_desc and dataframe1 are two pandas dataframes.
On checking the core utilization using mpstat -P ALL 1
, I noticed that out of 16 CPU cores, no core was utilizing more than 20 %. However, the sum of utilization of all the cores was approx 100 percent.
Is utilization of all the cores to more than 50 percent, possible using dask? If Yes, then how should I do it or modify my code to achieve the task?
Thanks.
Upvotes: 1
Views: 1851
Reputation: 57281
The default scheduler for dask dataframe uses multiple threads. This is the right choice for most pandas computations, especially vectorized numeric operations, but not all.
Your computation however is mostly pure Python code, and so will be affected by the GIL. I recommend that you use the multiprocessing scheduler instead
res.compute(scheduler='processes')
Upvotes: 2