mitchkman
mitchkman

Reputation: 6680

How to install custom Spark version in Cloudera

I am new to Spark, Hadoop and Cloudera. We need to use a specific version (1.5.2) of Spark and also have the requirement to use Cloudera for the cluster management, also for Spark.

However, CDH 5.5 comes with Spark 1.5.0 and can not be changed very easily.

People are mentioning to "just download" a custom version of spark manually. But how to manage this "custom" spark version by Cloudera, so I can distribute it across the cluster? Or, does it need to be operated and provisioned completely separate from Cloudera?

Thanks for any help and explanation.

Upvotes: 0

Views: 2024

Answers (2)

xmar
xmar

Reputation: 1809

Under YARN, you can run any application, with any version of Spark. After all, Spark it's a bunch of libraries, so you can pack your jar with your dependencies and send it to YARN. However there are some additional, small tasks to be done.

In the following link, dlb8 provides a list of tasks to be done to run Spark 2.0 in an installation with a previous version. Just change version/paths accordingly.

Find the version of CDH and Hadoop running on your cluster using

$ hadoop version
Hadoop 2.6.0-cdh5.4.8

Download Spark and extract the sources. Pre built Spark binaries should work out of the box with most CDH versions, unless there are custom fixes in your CDH build in which case you can use the spark-2.0.0-bin-without-hadoop.tgz. (Optional) You can also build Spark by opening the distribution directory in the shell and running the following command using the CDH and Hadoop version from step 1

$ ./dev/make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn

Note: With Spark 2.0 the default build uses Scala version 2.11. If you need to stick to Scala 2.10, use the -Dscala-2.10 property or $ ./dev/change-scala-version.sh 2.10 Note that -Phadoop-provided enables the profile to build the assembly without including Hadoop-ecosystem dependencies provided by Cloudera.

Extract the tgz file.

$tar -xvzf /path/to/spark-2.0.0-bin-hadoop2.6.tgz

cd into the custom Spark distribution and configure the custom Spark distribution by copying the configuration from your current Spark version

$ cp -R /etc/spark/conf/* conf/
$ cp /etc/hive/conf/hive-site.xml conf/

Change SPARK_HOME to point to folder with the Spark 2.0 distribution

$ sed -i "s#\(.*SPARK_HOME\)=.*#\1=$(pwd)#" conf/spark-env.sh

Change spark.master to yarn from yarn-client in spark-defaults.conf

$ sed -i 's/spark.master=yarn-client/spark.master=yarn/' conf/spark-

defaults.conf

Delete spark.yarn.jar from spark-defaults.conf

$ sed '-i /spark.yarn.jar/d' conf/spark-defaults.conf

Finally test your new Spark installation:

$ ./bin/run-example SparkPi 10 --master yarn
$ ./bin/spark-shell --master yarn
$ ./bin/pyspark

Update log4j.properties to suppress annoying warnings. Add the following to conf/log4j.properties

echo "log4j.logger.org.spark_project.jetty=ERROR" >> conf/log4j.properties

However, it can be adapted to the opposite, since the bottom line is "to use a Spark version on an installation with a different version". It's even simpler if you don't have to deal with 1.x - 2.x version changes, because you don't need to pay attention to the change of scala version and of the assembly approach.

I tested it in a CDH5.4 installation to set 1.6.3 and it worked fine. I did it with the "spark.yarn.jars" option:

####  set "spark.yarn.jars"
$ cd  $SPARK_HOME
$ hadoop fs mkdir spark-2.0.0-bin-hadoop 
$ hadoop fs -copyFromLocal jars/* spark-2.0.0-bin-hadoop 
$ echo "spark.yarn.jars=hdfs:///nameservice1/user/<yourusername>/spark-2.0.0-bin-hadoop/*" >> conf/spark-defaults.conf

Upvotes: 0

JustCoder
JustCoder

Reputation: 337

Yes, It is possible to run any Apache Spark version .!!

Steps we need to make sure before doing it:

  • You have YARN configured in the CM. After which you can run your application as a YARN application with spark-submit. please refer to this link. It will be used to work like any other YARN application.
  • It is not mandatory to install spark, you can run your application.

Upvotes: 1

Related Questions