Reputation: 960
TLDR - I want to run the command sudo yes | sudo pip3 uninstall numpy
twice in EMR bootstrap actions but it runs only once.
I will first say that my goal is to run a Pyspark-enabled EMR managed notebook, running on an EMR cluster. For various reasons I need pandas to be installed on the cluster as well. First, I encountered a problem where two numpy versions exist in the default python3 installation, and they both have to be removed to use the newer version (as in this thread - How do I have multiple versions of numpy installed on Amazon EMR and how to I delete the early versions?).
If I ssh into the master node and perform sudo yes | sudo pip3 uninstall numpy
twice, it works:
[hadoop@ip-xxx-xx-xx-xxx ~]$ sudo yes | sudo pip3 uninstall numpy
Uninstalling numpy-1.21.1:
/usr/bin/f2py
/usr/local/bin/f2py
/usr/local/bin/f2py3
/usr/local/bin/f2py3.7
.......
.......
.......
/usr/local/lib64/python3.7/site-packages/numpy/typing/tests/test_runtime.py
/usr/local/lib64/python3.7/site-packages/numpy/typing/tests/test_typing.py
/usr/local/lib64/python3.7/site-packages/numpy/version.py
Proceed (y/n)? Successfully uninstalled numpy-1.21.1
[hadoop@ip-xxx-xx-xx-xxx ~]$ sudo yes | sudo pip3 uninstall numpy
Uninstalling numpy-1.16.5:
/usr/local/lib64/python3.7/site-packages/numpy
/usr/local/lib64/python3.7/site-packages/numpy-1.16.5-py3.7.egg-info
Proceed (y/n)? Successfully uninstalled numpy-1.16.5
I get numpy removed from the python3 installation, and then I can install numpy and pandas normally.
The problem happens when I want to perform the same thing using bootstrap actions. Using this bootstrap.sh file:
#!/bin/bash
sudo yes | sudo yum install python3-devel
sudo pip3 install cython
sudo pip3 install matplotlib
sudo yes | sudo pip3 uninstall numpy
sudo pip3 install pyspark boto3
sudo yes | sudo pip3 uninstall numpy
sudo pip3 install numpy
sudo pip3 install pandas
Notice that I'm uninstalling numpy twice here, but it simply ignores the second sudo yes | sudo pip3 uninstall numpy
command! Because I haven't uninstalled the second numpy installation it results in a broken pandas installation (again, see the thread I linked to previously). Why does this happen? Because the bootstrap actions don't work, and it's impossible to ssh into the slave nodes I'm left with a broken pandas installation and no way to fix it.
Upvotes: 5
Views: 7521
Reputation: 960
OK, it ain't pretty, but this is what I did to solve:
First of all I used EMR version 5.33.0
and installed the following applications on the cluster: Spark
, JupyterEnterpriseGateway
(the latter because I am using EMR notebooks).
I used the following bootstrap actions to install necessary things for python:
#!/bin/bash
sudo yes | sudo yum install python3-devel
sudo pip3 install cython
sudo pip3 install setuptools --upgrade
After the cluster has spun up I run this external python script, which has to be given the master node's public DNS name, the master and slave nodes' private IP addresses and the private key filepath. What it does is uninstalls numpy, installs it again and then installs pandas. I have to say this solution feels very hacky and I would very much prefer to do it via bootstrap actions and I still don't really understand why I can't.
import argparse
import paramiko
from scp import SCPClient
import os
INSTALL_COMMANDS = [
'sudo yes | sudo pip3 uninstall numpy',
'sudo pip3 install numpy',
'sudo pip3 install pandas',
]
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--master_address')
parser.add_argument('--master_ip')
parser.add_argument('--slave_ips', nargs='+')
parser.add_argument('--username', default='hadoop')
parser.add_argument('--pk_filepath')
args = parser.parse_args()
key = paramiko.RSAKey.from_private_key_file(args.pk_filepath)
client = paramiko.SSHClient()
client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
client.connect(hostname=args.master_address, username=args.username, pkey=key)
scp = SCPClient(client.get_transport())
scp.put(args.pk_filepath, remote_path='~')
command_chmod = f'chmod 600 {os.path.split(args.pk_filepath)[-1]}'
print(f'executing: {command_chmod}')
stdin, stdout, stderr = client.exec_command(command_chmod)
all_clients = [client]
for slave_ip in args.slave_ips:
slave_channel = client.get_transport().open_channel('direct-tcpip',
(slave_ip, 22),
(args.master_ip, 22))
slave_client = paramiko.SSHClient()
slave_client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
slave_client.connect(hostname=slave_ip,
username=args.username,
pkey=key,
sock=slave_channel)
all_clients.append(slave_client)
for client in all_clients:
for command in INSTALL_COMMANDS:
print(f'executing {command} on {client}')
stdin, stdout, stderr = client.exec_command(command)
print(stderr.read())
Upvotes: 1
Reputation: 3495
Unfortunately this is not fixable at this moment for EMR clusters. I know its not popular solution, but "there is no solution" which you are looking for.
You can see the issue reported on official aws forum here for more details.
I will sum issue bellow.
If you want to have pandas installed, you need to have numpy. Issue at this moment is that (based on the current conclusion), regardless of what happens in the bootstrap.sh the system python37-numpy package gets installed later automatically after bootstrap.sh
This means that, if you install nympy again through bootstrap, importing numpy within the python will result with the importing older version, which is enforced by steps which are happening after boostrap during the emr cluster setup.
You have two solutions:
I personally used second solution and installed pandas 1.1.5 until numpy version is upgraded by aws/emr.
I would just add, regarding your statement "don't work, and it's impossible to ssh into the slave nodes".You can ssh into the master and into the executors when ever you want. The default security group for emr is not exposing port 22, hence if you are not able to ssh to the executors, i would take a look there first.
Upvotes: 8