yh18190
yh18190

Reputation: 419

Installing external libraries on worker nodes in Pyspark-Cluster mode

I am working on pyspark for NLP processing etc. I am using TextBlob Python library.

Normally, in standalone mode, it is easy to install the external Python libraries. In cluster mode I am facing problem to install these libraries on worker nodes remotely. I cannot access each worker machine to install these libs in Python path.

I tried to use Sparkcontext pyfiles option to ship .zip files...but the problem is these Python packages need to be installed on worker machines.

Are there different ways of doing it so that this lib-Textblob could be available in Python path?

Upvotes: 2

Views: 3112

Answers (1)

Shawn Guo
Shawn Guo

Reputation: 3228

I tried to use Sparkcontext pyfiles option to ship .zip files...but the problem is these Python packages needs to be get installed on worker machines.

I guess you use defautl URL schema(local:) local: - a URI starting with local:/ is expected to exist as a local file on each worker node. This means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker, or shared via NFS, GlusterFS, etc

Another URL schema is file:, every executor pulls the file from the driver HTTP server automatically, then you don't need to installed them on worker machines. file: - Absolute paths and file:/ URIs are served by the driver’s HTTP file server, and every executor pulls the file from the driver HTTP server.

please refer to Submitting Applications - Advanced Dependency Management

Upvotes: 1

Related Questions