Reputation: 1819
I have a python program which uses a lot of pandas
and sklearn
computing. It basically iterates over a dataframe and makes calculus.
The code uses the map
function of the multiprocessing
module. It also uses some sklearn
models with n_jobs = -1
.
It needs 1 TERA RAM and 100 cores to run. Sadly, the bigger machine I can launch in cloud providers is more about 16 cores and 100Go Ram.
Is there a simple way to adapt my python script to run it on a cluster of machine or something simimilar in order to deal with the computation?
I don't want to rewrite everything in Spark if I don't have to.
Upvotes: 1
Views: 1209
Reputation: 159
A bit late to the party, but for the people who stumble upon this question, you could also try Dask.
This page from the documentation describes how it compares to Spark, and the summary answers the question:
Generally Dask is smaller and lighter weight than Spark. This means that it has fewer features and, instead, is used in conjunction with other libraries, particularly those in the numeric Python ecosystem. It couples with libraries like Pandas or Scikit-Learn to achieve high-level functionality.
Upvotes: 0