Reputation: 1320
I want to use the spark-csv
package from https://github.com/databricks/spark-csv from within the jupyter service running on Spark HDInsight cluster on Azure.
From local cluster I know I can do this like:
export PACKAGES="com.databricks:spark-csv_2.11:1.3.0"
export PYSPARK_SUBMIT_ARGS="--packages ${PACKAGES} pyspark-shell"
However I don't understand/know where to put this in the Azure spark configuration.. Any clues hints are appreciated.
Upvotes: 3
Views: 993
Reputation: 2571
Since you are using HDInsight, you can use a "Script Action" on the Spark cluster load that imports the needed libraries. The script can be a very simple shell script and it can be automatically executed on startup, and automatically re-executed on new nodes if the cluster is resized.
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-customize-cluster-linux/
Upvotes: 0
Reputation: 486
You can use the %%configure
magic to add any required external package.
It should be as simple as putting the following snippet in your first code cell.
%%configure
{ "packages":["com.databricks:spark-csv_2.10:1.4.0"] }
This specific example is also covered in the documentation. Just make sure you start Spark session after the %%configure
cell.
Upvotes: 2
Reputation: 91
One option for managing Spark packages in a cluster from a Jupyter notebook is Apache Toree. Toree gives you some extra line magics that allow you to manage Spark packages from within a Jupyter notebook. For example, inside a Jupyter scala notebook, you would install spark-csv with
%AddDeps com.databricks spark-csv_2.11 1.4.0 --transitive
To install Apache Toree on your Spark clusters, ssh into you Spark clusters and run,
sudo pip install --pre toree
sudo jupyter toree install \
--spark_home=$SPARK_HOME \
--interpreters=PySpark,SQL,Scala,SparkR
I know you specifically asked about Jupyter notebooks running PySpark. At this time, Apache Toree is an incubating project. I have run into trouble using the provided line magics with pyspark notebooks specifically. Maybe you will have better luck. I am looking into why this is, but personally, I prefer Scala in Spark. Hope this helps!
Upvotes: 1
Reputation: 146
You can try to execute your two lines of code (export ...) in a script that you can invoke in Azure at the time of creation of the HDInsight cluster.
Upvotes: 0