Giannis
Giannis

Reputation: 5526

Dask performance on distributed workers

I am trying to decide if Spark or Dask gives us better performance for the work we are doing. I have a simple script that runs some operations on a DataFrame.

I am not convinced I am using distributed version correct, as the times are much slower than using dask locally. Here is my script:

 def CreateTransactionFile(inputFile, client):
     startTime = time.time()
     df = dd.read_csv(inputFile)

     mock = pd.DataFrame([[1,1, datetime.today(), 1,1]], columns=['A', 'B', 'C', 'D', 'E'])

     outDf = df.map_partitions(CreateTransactionFile_Partition, meta=mock)
     outDf.compute()
     print(str(time.time() - startTime))


 if __name__ == '__main__':
     client = Client('10.184.62.61:8786')
     hdfs = 'hdfs://dir/python/test/bigger.csv'
     CreateTransactionFile(hdfs , client)

CreateTransactionFile_Partition operates using Pandas and Numpy on the provided dateframe, and returns a dataframe as a result.

Should I be using something other than compute? The above code is twice as slow (230s vs 550s) on a 700M row CSV (~30GB) than when running on a local machine. Local test is using local file, where multi-worker is using HDFS.

Upvotes: 1

Views: 281

Answers (1)

mdurant
mdurant

Reputation: 28673

outDf.compute()

What's happening here: the workers are loading and processing the partitions of the data, and then the results are being copied to the client and assembled into a single, in-memory data-frame. This copying requires potentially expensive inter-process communication. That might be what you want, if the processing is aggregational, and the output small.

However, if the output is large, you want to do your processing on the workers by using the dataframe API without .compute(), perhaps writing the output to files with, e.g., .to_parquet().

Upvotes: 2

Related Questions