Reputation: 61
I am trying to run a Amazon EMR ( Version: emr-6.1.0 ), and wanted some python packages to be preinstalled.
So, I used a bootstrap script:
#!/bin/bash
sudo pip3 install --user pyspark pandas xlrd==1.2.0
The EMR starts up fine. But when I try to import any of the modules I installed, it gives an import error.
Python 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.24.1 -- An enhanced Interactive Python. Type '?' for help.
>>> import xlrd
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-1-2743bb67f6dd> in <module>
----> 1 import xlrd
ModuleNotFoundError: No module named 'xlrd'
My first thought was that the packages are not getting installed, but the EMR log file,
stdout.gz (ie, in the path: Amazon S3 /aws-logs-600286585385-us-east-1/elasticmapreduce/j-27GOG786YFR2SB/node/i-02fabe3g74jf9959a/bootstrap-actions/1/) says otherwise:
Collecting pyspark
Downloading https://files.pythonhosted.org/packages/89/db/e18cfd78e408de957821ec5ca56de1250645b05f8523d169803d8df35a64/pyspark-3.1.2.tar.gz (212.4MB)
Collecting pandas
Downloading https://files.pythonhosted.org/packages/99/f7/01cea7f6c963100f045876eb4aa1817069c5c9eca73d2dbfb5d31ff9a39f/pandas-1.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (10.8MB)
Collecting xlrd==1.2.0
Downloading https://files.pythonhosted.org/packages/b0/16/63576a1a001752e34bf8ea62e367997530dc553b689356b9879339cf45a4/xlrd-1.2.0-py2.py3-none-any.whl (103kB)
Collecting py4j==0.10.9 (from pyspark)
Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
Collecting python-dateutil>=2.7.3 (from pandas)
Downloading https://files.pythonhosted.org/packages/d4/70/d60450c3dd48ef87586924207ae8907090de0b306af2bce5d134d78615cb/python_dateutil-2.8.1-py2.py3-none-any.whl (227kB)
Collecting numpy>=1.17.3 (from pandas)
Downloading https://files.pythonhosted.org/packages/2c/d2/8973eb282fc3c7e6c4db0469f0390d81d8eb9ae56dfaa2a7e6db07283682/numpy-1.21.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (14.1MB)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/site-packages (from pandas)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas)
Installing collected packages: py4j, pyspark, python-dateutil, numpy, pandas, xlrd
Running setup.py install for pyspark: started
Running setup.py install for pyspark: finished with status 'done'
Successfully installed numpy-1.21.0 pandas-1.3.0 py4j-0.10.9 pyspark-3.1.2 python-dateutil-2.8.1 xlrd-1.2.0
Any ideas on what is going on, or how to solve the issue?
Upvotes: 0
Views: 1284
Reputation: 61
I solved my issue, and am posting what went wrong here.
After creating the EMR instance, it first enters the "Starting" state. Even though you can connect notebook to the EMR instance in this state, the bootstrapping has not yet been done. After sometime, the instance automatically enters the "Bootstrapping" state, in which the bootstrap commands are executed.
I made a mistake of trying to import the packages before the "Bootstrapping" state was over, which caused the import error.
For further info regarding the lifecycle of a EMR instance, check this doc.
Upvotes: 2
Reputation: 1410
If pip3 doesn't work, try this way
sudo python3 -m pip install pandas xlrd==1.2.0
I faced the similar issue when I was working with emr-5.26.0
. It worked for me. But not sure what's the difference between pip3 install
and python3 -m pip install
Upvotes: 1