Reputation: 1434
I created a cluster on AWS and with Jupyter, python3 installed. Now I can type code in the cells and I found 'numpy' is installed, i.e., by import numpy as np
, I am able to access the functions in this package. However, I found pandas
is not there. So in the next cell I typed !pip install pandas
, then it displays
Requirement already satisfied: pandas in /mnt/usrmoved/local/lib64/python2.7/site-packages
Requirement already satisfied: pytz>=2011k in /mnt/usrmoved/local/lib/python2.7/site-packages (from pandas)
Requirement already satisfied: numpy>=1.7.0 in /mnt/usrmoved/local/lib64/python2.7/site-packages (from pandas)
Requirement already satisfied: python-dateutil in /mnt/usrmoved/local/lib/python2.7/site-packages (from pandas)
Requirement already satisfied: six>=1.5 in /mnt/usrmoved/local/lib/python2.7/site-packages (from python-dateutil->pandas)
I thought it is successfully installed, but then in the next cell, I type import pandas as pd
it gives me an error
---------------------------------------------------------------------------
ImportError
Traceback (most recent call last)<ipython-input-8-af55e7023913> in <module>()----> 1 import pandas as pd
ImportError: No module named 'pandas'
In general, how should we install related python packages in EMR?
In my laptop, in the jupyter, I always did "! pip install package" and it works. But why it does not work in jupyer on EMR?
Upvotes: 3
Views: 13253
Reputation: 549
I tried installing python packages using pip install
, but I get the pip: command not found
. So I used pip3
instead of pip, and it worked.
Using EMR 5.30.1
Upvotes: 4
Reputation: 382
The conventional method to install python packages on EMR is to specify the packages needed at cluster creation using a bootstrap-action.
This method ensures the packages are installed on all nodes and not just the driver.
aws emr create-cluster \
--name 'test python packages' \
--release-label emr-5.20.0 \
--region us-east-1 \
--use-default-roles
--instance-type m4.large \
--instance-count 2 \
--bootstrap-actions \
Path="s3://your-bucket/python-modules.sh",Name='Install Python Modules' \
The python-modules.sh
would contain commands to install the python packages. For example:
#!/bin/sh
# Install needed packages
sudo pip install pandas
Upvotes: 1