Reputation:
I have deployed HDP: 2.6.4 on a virtual machine
I can see that the spark2 is not pointing to the correct python folder. My questions are
1) How can I find where my python is located?
solution: Type whereis python
and you will get a list of where it is
2) How can I update the existing python libraries and add new libraries to that folder ? For example, the equivalent of 'pip install numpy' on CLI.
3) How can I make Zeppelin Spark2 point at that specific directory that contains the python folder that I can update? - On Zeppelin, there is a little 'edit' button that I can change the path to the directory that contains python.
solution: go to the interpreter in zeppelin, find spark2, and make zeppelin.pyspark.python point to where python is already there.
Now if you need python 3.4+ there is a whole set of different steps you have to do, to first get python 3.4.+ into the HDP sandbox.
Thank you,
Upvotes: 0
Views: 2042
Reputation: 1631
For a Sandbox environment like yours, a sandbox image is made on a Linux OS (CentOS). The Zeppelin Notebook points, in all probability, to the Python installation that comes along with every Linux OS. If you wish to have your own installation of Python and your own set of libraries for Data Analysis like those in the SciPy stack. You need to install Anaconda on your Virtual machine. Your VM eed to be connected to the internet so that you can download and install the Anaconda package for testing.
You can then point Zeppelin to the anaconda's directory till the following path : /home/user/anaconda3/bin/python where user is your username
Zeppelin Configuration also confirms the fact that it uses the default python installation at /usr/bin/python
. You can go through its documentation for more Information
UPDATE
Hi Joseph, Spark Installations, by default, use the Python interpreter and the python libraries that have been installed on your OS. The folder structure that you have shown only tell you the location of the PySpark module. This module is a library like Pandas ior NumPy.
What you can do is install the SciPy Stack[NumPy, Pandas, MatplotLib etc..] via the command pip install package name
and import those libraries directly into your Zeppelin Notebook.
Use the command whereis python
in the terminal of your snadbox, the result would give you something as follows
/usr/bin/python /usr/bin/python2.7 ....
In your Zeppelin Configuration, for the property zeppelin.pyspark.python
you can set the first value from the out put of the previous command i.e /usr/bin/python
. So now all the libraries you installed via the pip install
command would be available for you in zeppelin.
This process would only work for your Sandbox environment. In a real production cluster, your administrator needs to install all these libraries on all the nodes of your Spark cluster.
Upvotes: 1