B.Mr.W.
B.Mr.W.

Reputation: 19638

Process lots of images using AWS

I have a lot of images 100K+ stored in S3, and I have some code using pySpark to process some of them. I am using Anaconda Python so tons of libraries have already been properly installed, I am using the library scipy, PIL for images processing.

I am planning to use EMR, but here are my options:

  1. How do I properly install all the libraries without too much hassle on the cluster:

For Python applications, simply pass a .py file in the place of instead of a JAR, and add Python .zip, .egg or .py files to the search path with --py-files. - [Spark Documentation]

  1. They also support customized bootstrap to install software while provisioning the cluster. However, it turned out the linux installation of Anaconda is not as easy as 'yum install -y'. The installation involves:

    • download anacondaxxx.sh
    • bash anacondaxxx.sh
    • #answer 4 or 5 questions interactively
    • ..

Can anyone point me to the right direction what is a better way to bring up a cluster with Spark and Anaconda Python (or at least scipy and PIL) installed.

Upvotes: 0

Views: 133

Answers (1)

jarmod
jarmod

Reputation: 78803

Can you use EMR bootstrap actions to do a silent install of anaconda?

You might also want to consider Lambda as it now supports Python (2.7). Given that the files are already in S3, you'd need to script Lambda events for them.

Upvotes: 1

Related Questions