Reputation: 593
I am new to Spark. I am able to launch, manage and shut down Spark clusters on Amazon EC2 from http://spark.incubator.apache.org/docs/0.7.3/ec2-scripts.html.
But I am not able to add below job on cluster.
package spark.examples
import spark.SparkContext
import SparkContext._
object SimpleJob {
def main(args: Array[String]) {
val logFile = "< Amazon S3 file url>"
val sc = new SparkContext(
"spark://<Host Name>:7077",
"Simple Job",
System.getenv("SPARK_HOME"), Seq("<Jar Address>")
)
val logData = sc.textFile(logFile)
val numsa = logData.filter(line => line.contains("a")).count
val numsb = logData.filter(line => line.contains("b")).count
println("total a : %s, total b : %s".format(numsa, numsb))
}
}
I have created a SimpleJob.scala and added in spark.examples package on my local spark directory. After that I run the command:
./spark-ec2 -k <keypair> -i <key-file> login <cluster-name>
Cluster is started and I am able to login in cluster. But I don't know how to add and run this job on EC2 cluster.
Upvotes: 1
Views: 1513
Reputation: 186
If you are able to run locally, then, most probably the issue can be Spark workers are not able to access your jar. Let me know if the following steps work-
Export your code into a jar file (I usually use Eclipse but you can use sbt too)
Run the command at master as
SPARK_CLASSPATH=<path/to/jar/file> ./run <Class> [arguements]
For example,
SPARK_CLASSPATH=Simple.jar ./run spark.examples.SimpleJob
Also make sure your workers are alive from Spark master UI. Hope this helps!
Upvotes: 1
Reputation: 40973
I suggest you try first to run it locally, once you achieve that you will have a better idea of the process involved. Follow the instructions here in the section "A standalone job in Scala". Then copy the script to the remote machine and run the script from there with:
./run spark.examples.SimpleJob
IF you try to connect to your remote spark from the local script with:
MASTER=spark://ec2-174-129-181-44.compute-1.amazonaws.com:7077 ./run spark.examples.SimpleJob
the most probably result is that you will get a connection error as port 7077 is blocked by default in EC2.
Upvotes: 1