JuanPabloMF
JuanPabloMF

Reputation: 497

Using all cores in Dask

I am working on a google cloud computing instance with 24 vCPUs. The code running is the following

import dask.dataframe as dd
from distributed import Client
client = Client()

#read data
logd = (dd.read_csv('vol/800000test', sep='\t', parse_dates=['Date'])
         .set_index('idHttp')
         .rename(columns={'User Agent Type':'UA'})
         .categorize())

When I run it (and this is also the case for the posterior data analysis I am doing after loading the data) I see 11 cores being used, sometimes 4.

enter image description here

Is there any way to control this better and make full use of cores?

Upvotes: 5

Views: 1532

Answers (1)

mdurant
mdurant

Reputation: 28683

read_csv will split your files into chunks according to the chunksize parameter, at least one chunk per input file. You are reading only one file, and it seems you are getting four partitions (i.e., size < 4 * 64MB). This may be reasonable for the amount of data, and extra parallelization of many small tasks maybe would only add overhead.

Nevertheless, you could change the blocksize parameter and see what difference it has for you, or see what happens when you pass multiple files, e.g., read_csv('vol/*test'). Alternatively, you can set the partitioning in the call to set_index.

Upvotes: 1

Related Questions