KevinKim
KevinKim

Reputation: 1434

How to install packages on EMR

I created a cluster on AWS and with Jupyter, python3 installed. Now I can type code in the cells and I found 'numpy' is installed, i.e., by import numpy as np, I am able to access the functions in this package. However, I found pandas is not there. So in the next cell I typed !pip install pandas, then it displays

Requirement already satisfied: pandas in /mnt/usrmoved/local/lib64/python2.7/site-packages
Requirement already satisfied: pytz>=2011k in /mnt/usrmoved/local/lib/python2.7/site-packages (from pandas)
Requirement already satisfied: numpy>=1.7.0 in /mnt/usrmoved/local/lib64/python2.7/site-packages (from pandas)
Requirement already satisfied: python-dateutil in /mnt/usrmoved/local/lib/python2.7/site-packages (from pandas)
Requirement already satisfied: six>=1.5 in /mnt/usrmoved/local/lib/python2.7/site-packages (from python-dateutil->pandas)

I thought it is successfully installed, but then in the next cell, I type import pandas as pd it gives me an error

---------------------------------------------------------------------------
ImportError                               
Traceback (most recent call last)<ipython-input-8-af55e7023913> in <module>()----> 1 import pandas as pd

ImportError: No module named 'pandas'

In general, how should we install related python packages in EMR?

In my laptop, in the jupyter, I always did "! pip install package" and it works. But why it does not work in jupyer on EMR?

Upvotes: 3

Views: 13253

Answers (2)

Daniel R Carletti
Daniel R Carletti

Reputation: 549

I tried installing python packages using pip install, but I get the pip: command not found. So I used pip3 instead of pip, and it worked.

Using EMR 5.30.1

Upvotes: 4

Justin C.
Justin C.

Reputation: 382

The conventional method to install python packages on EMR is to specify the packages needed at cluster creation using a bootstrap-action.

This method ensures the packages are installed on all nodes and not just the driver.

aws emr create-cluster \
--name 'test python packages' \
--release-label emr-5.20.0 \
--region us-east-1 \
--use-default-roles
--instance-type m4.large \
--instance-count 2 \
--bootstrap-actions \
    Path="s3://your-bucket/python-modules.sh",Name='Install Python Modules' \

The python-modules.sh would contain commands to install the python packages. For example:

#!/bin/sh

# Install needed packages
sudo pip install pandas

AWS documentation

Upvotes: 1

Related Questions