sweeeeeet
sweeeeeet

Reputation: 1819

Multicore python as an alternative to spark

I have a python program which uses a lot of pandas and sklearn computing. It basically iterates over a dataframe and makes calculus. The code uses the map function of the multiprocessing module. It also uses some sklearn models with n_jobs = -1.

It needs 1 TERA RAM and 100 cores to run. Sadly, the bigger machine I can launch in cloud providers is more about 16 cores and 100Go Ram.

Is there a simple way to adapt my python script to run it on a cluster of machine or something simimilar in order to deal with the computation?

I don't want to rewrite everything in Spark if I don't have to.

Upvotes: 1

Views: 1209

Answers (2)

jharb
jharb

Reputation: 159

A bit late to the party, but for the people who stumble upon this question, you could also try Dask.

This page from the documentation describes how it compares to Spark, and the summary answers the question:

Generally Dask is smaller and lighter weight than Spark. This means that it has fewer features and, instead, is used in conjunction with other libraries, particularly those in the numeric Python ecosystem. It couples with libraries like Pandas or Scikit-Learn to achieve high-level functionality.

Upvotes: 0

noxdafox
noxdafox

Reputation: 15060

You can take a look at Celery.

The project focuses into solving your problem.

The execution units, called tasks, are executed concurrently on a single or more worker servers...

Upvotes: 1

Related Questions