I am just getting started with Spark, so downloaded the for Hadoop 1 (HDP1, CDH3) binaries from here and extracted it on a Ubuntu VM. Without installing Scala, I was able to execute the examples in the Quick Start guide from the Spark interactive shell. Does Spark come included with Scala? If yes, where are the libraries/binaries? For running Spark in other modes (distributed), do I need to install Scala on all the nodes? As a side note, I observed that Spark has one of the best documentation around open source projects.

Reputation: 33495

Scala dependency on Spark installation

I am just getting started with Spark, so downloaded the for Hadoop 1 (HDP1, CDH3) binaries from here and extracted it on a Ubuntu VM. Without installing Scala, I was able to execute the examples in the Quick Start guide from the Spark interactive shell.

Does Spark come included with Scala? If yes, where are the libraries/binaries?
For running Spark in other modes (distributed), do I need to install Scala on all the nodes?

As a side note, I observed that Spark has one of the best documentation around open source projects.

Upvotes: 5

Answers (4)

RisJi

Reputation: 182

From spark 1.1 onwards, there is no SparkBuild.scala You ahve to make your changes in pom.xml and build using Maven

Upvotes: 0

tuxdna

Reputation: 8487

Does Spark come included with Scala? If yes, where are the libraries/binaries?

The project configuration is placed in project/ folder. I my case here it is:

$ ls project/
build.properties  plugins.sbt  project  SparkBuild.scala  target

When you do sbt/sbt assembly, it downloads appropriate version of Scala along with other project dependencies. Checkout the folder target/ for example:

$ ls target/
scala-2.9.2  streams

Note that Scala version is 2.9.2 for me.

For running Spark in other modes (distributed), do I need to install Scala on all the nodes?

Yes. You can create a single assembly jar as described in Spark documentation

If your code depends on other projects, you will need to ensure they are also present on the slave nodes. A popular approach is to create an assembly jar (or “uber” jar) containing your code and its dependencies. Both sbt and Maven have assembly plugins. When creating assembly jars, list Spark itself as a provided dependency; it need not be bundled since it is already present on the slaves. Once you have an assembled jar, add it to the SparkContext as shown here. It is also possible to submit your dependent jars one-by-one when creating a SparkContext.

Upvotes: 4

Vidya

Reputation: 30310

You do need Scala to be available on all nodes. However, with the binary distribution via make-distribution.sh, there is no longer a need to install Scala on all nodes. Keep in mind the distinction between installing Scala, which is necessary to run the REPL, and merely packaging Scala as just another jar file.

Also, as mentioned in the file:

# The distribution contains fat (assembly) jars that include the Scala library,
# so it is completely self contained.
# It does not contain source or *.class files.

So Scala does indeed come along for the ride when you use make-distribution.sh.

Upvotes: 1

vijay kumar

Reputation: 2049

Praveen -

checked now the fat-master jar.

/SPARK_HOME/assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.0-incubating-hadoop1.0.4.jar

this jar included with all the scala binaries + spark binaries.

you are able to run because this file is added to your CLASSPAH when you run spark-shell

check here : run spark-shell > http:// machine:4040 > environment > Classpath Entries

if you downloaded pre build spark , then you don't need to have scala in nodes, just this file in CLASSAPATH in nodes is enough.

note: deleted the last answer i posted, cause it may mislead some one. sorry :)

Upvotes: 3

Scala dependency on Spark installation

Answers (4)

Related Questions