Reputation: 33
I have around 1000 csv files which are more than 400 mb each containing more than 2 million rows. I would like to process each of the files for data analysis which includes a lot of filtering and using loops. I made a dummy code to test this and it is taking around 4-5 minutes for each file. Is there any way to make this process faster ? Can DASK reduce this process time ?
Upvotes: 1
Views: 385
Reputation: 8088
In general, your task is looks "good" for massive parralel processing. Usually you need to convert your code in functional style so external framework can schedule your work in multiple servers. Fortunetely there is easier way which can help you. You just need to wrap up your code to process single file in form like this:
result_dataframe = processing(input_dataframe)
assuming that input_dataframe is your file in the pandas dataframe.
Than you can use Dask's map_partitions function to run it in parralel for you https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.map_partitions.html over your files.
Upvotes: 1
Reputation: 28683
Short answer: yes, it sounds like your processing is simply parallelizable, so Dask (or any other method) would do the job for you. Assuming func
is what you want to run on each filename:
from dask import delayed, distributed, compute
client = distributed.Client()
tasks = [delayed(func)(fn) for fn in all_my_filenames)]
dask.compute(tasks)
(uses default "distributed" scheduler, which will use as many CPU cores as you have available)
Note that you might want to run with fewer than the default number of cores if your calculation is memory intensive. Also note, that this will do little to improve the IO performance of loading or writing byes, because your disk is only so fast and all the cores will be trying to use it at once.
To answer a point in the comments:
Have you looked into Python's multiprocessing and/or multithreading? Dask has quite a bit of overhead and may not provide better performance than native MP/MT unless you're scaling up with multiple machines.
This is true for a situation like this, you could do the processing by hand with multiprocessing and/or threading, but dask's API is probably easier if you've never done that, the overhead is essentially the same, and for 4min/task, negligible.
Upvotes: 0