Georg Heiler
Georg Heiler

Reputation: 17724

move from pandas to dask to utilize all local cpu cores

Recently I stumbled upon http://dask.pydata.org/en/latest/ As I have some pandas code which only runs on a single core I wonder how to make use of my other CPU cores. Would dask work well to use all (local) CPU cores? If yes how compatible is it to pandas?

Could I use multiple CPUs with pandas? So far I read about releasing the GIL but that all seems rather complicated.

Upvotes: 6

Views: 3981

Answers (2)

mdurant
mdurant

Reputation: 28684

Dask implements a large fraction of the pandas API in its dataframes. These operations call the very same pandas function on chunks of your overall dataframe, so you should expect them to be totally compatible.

The resulting computations can be run in any of the available schedulers allowing you to choose whether you are running low-overhead threads or something more complex. The distributed scheduler gives you full control over the split between threads and processes, has more features, and can be scaled out later across a cluster, so is becoming increasingly the preferred option, even for simple single-machine tasks.

Many pandas operations do release the GIL and so will work efficiently with threads. Also, many pandas operations can be easily broken down into parallel chunks - but some cannot and will either be slower (such as joins requiring shuffles), or not work at all (such as multi-indexing). The best way to find out is to give it a try!

Upvotes: 4

John Zwinck
John Zwinck

Reputation: 249642

Would dask work well to use all (local) CPU cores?

Yes.

how compatible is it to pandas?

Pretty compatible. Not 100%. You can mix in Pandas and NumPy and even pure Python stuff with Dask if needed.

Could I use multiple CPUs with pandas?

You could. The easiest way would be to use multiprocessing and keep your data separate--have each job independently read from disk and write to disk if you can do so efficiently. A significantly harder way is using mpi4py which is most useful if you have a multi-computer environment with a professional administrator.

Upvotes: 5

Related Questions