Emerdan
Emerdan

Reputation: 61

Import not working in Amazon EMR even after pip install in bootstrap

I am trying to run a Amazon EMR ( Version: emr-6.1.0 ), and wanted some python packages to be preinstalled.

So, I used a bootstrap script:

#!/bin/bash
sudo pip3 install --user pyspark pandas xlrd==1.2.0

The EMR starts up fine. But when I try to import any of the modules I installed, it gives an import error.

Python 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.24.1 -- An enhanced Interactive Python. Type '?' for help.

>>> import xlrd

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-2743bb67f6dd> in <module>
----> 1 import xlrd

ModuleNotFoundError: No module named 'xlrd'

My first thought was that the packages are not getting installed, but the EMR log file,

stdout.gz (ie, in the path: Amazon S3 /aws-logs-600286585385-us-east-1/elasticmapreduce/j-27GOG786YFR2SB/node/i-02fabe3g74jf9959a/bootstrap-actions/1/) says otherwise:

Collecting pyspark
  Downloading https://files.pythonhosted.org/packages/89/db/e18cfd78e408de957821ec5ca56de1250645b05f8523d169803d8df35a64/pyspark-3.1.2.tar.gz (212.4MB)
Collecting pandas
  Downloading https://files.pythonhosted.org/packages/99/f7/01cea7f6c963100f045876eb4aa1817069c5c9eca73d2dbfb5d31ff9a39f/pandas-1.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (10.8MB)
Collecting xlrd==1.2.0
  Downloading https://files.pythonhosted.org/packages/b0/16/63576a1a001752e34bf8ea62e367997530dc553b689356b9879339cf45a4/xlrd-1.2.0-py2.py3-none-any.whl (103kB)
Collecting py4j==0.10.9 (from pyspark)
  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
Collecting python-dateutil>=2.7.3 (from pandas)
  Downloading https://files.pythonhosted.org/packages/d4/70/d60450c3dd48ef87586924207ae8907090de0b306af2bce5d134d78615cb/python_dateutil-2.8.1-py2.py3-none-any.whl (227kB)
Collecting numpy>=1.17.3 (from pandas)
  Downloading https://files.pythonhosted.org/packages/2c/d2/8973eb282fc3c7e6c4db0469f0390d81d8eb9ae56dfaa2a7e6db07283682/numpy-1.21.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (14.1MB)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/site-packages (from pandas)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas)
Installing collected packages: py4j, pyspark, python-dateutil, numpy, pandas, xlrd
  Running setup.py install for pyspark: started
    Running setup.py install for pyspark: finished with status 'done'
Successfully installed numpy-1.21.0 pandas-1.3.0 py4j-0.10.9 pyspark-3.1.2 python-dateutil-2.8.1 xlrd-1.2.0

Any ideas on what is going on, or how to solve the issue?

Upvotes: 0

Views: 1284

Answers (2)

Emerdan
Emerdan

Reputation: 61

I solved my issue, and am posting what went wrong here.

After creating the EMR instance, it first enters the "Starting" state. Even though you can connect notebook to the EMR instance in this state, the bootstrapping has not yet been done. After sometime, the instance automatically enters the "Bootstrapping" state, in which the bootstrap commands are executed.

I made a mistake of trying to import the packages before the "Bootstrapping" state was over, which caused the import error.

For further info regarding the lifecycle of a EMR instance, check this doc.

Upvotes: 2

SnigJi
SnigJi

Reputation: 1410

If pip3 doesn't work, try this way

sudo python3 -m pip install pandas xlrd==1.2.0

I faced the similar issue when I was working with emr-5.26.0. It worked for me. But not sure what's the difference between pip3 install and python3 -m pip install

Upvotes: 1

Related Questions