Dask performance on distributed workers

Question

I am trying to decide if Spark or Dask gives us better performance for the work we are doing. I have a simple script that runs some operations on a DataFrame.

I am not convinced I am using distributed version correct, as the times are much slower than using dask locally. Here is my script:

 def CreateTransactionFile(inputFile, client):
     startTime = time.time()
     df = dd.read_csv(inputFile)

     mock = pd.DataFrame([[1,1, datetime.today(), 1,1]], columns=['A', 'B', 'C', 'D', 'E'])

     outDf = df.map_partitions(CreateTransactionFile_Partition, meta=mock)
     outDf.compute()
     print(str(time.time() - startTime))


 if __name__ == '__main__':
     client = Client('10.184.62.61:8786')
     hdfs = 'hdfs://dir/python/test/bigger.csv'
     CreateTransactionFile(hdfs , client)

CreateTransactionFile_Partition operates using Pandas and Numpy on the provided dateframe, and returns a dataframe as a result.

Should I be using something other than compute? The above code is twice as slow (230s vs 550s) on a 700M row CSV (~30GB) than when running on a local machine. Local test is using local file, where multi-worker is using HDFS.

Dask performance on distributed workers

Answers (1)

Related Questions