Reputation: 2213
I am new to running Spark on Kubernetes and I have a very simple app I am trying to package to run in the Spark on K8s cluster that I have setup. The challenge I am facing is how to package my app to run in Spark? I installed Spark on K8's from the Spark Operator and I notice that the examples use the image gcr.io/spark-operator/spark:v3.0.0
.
I noticed the Spark documentation mentions this docker-image-tool.sh
script to generate an app. but this looks like it would be for custom tailoring the environment. I would just like a lightweight image my app can use to run on the Spark cluster. Not sure how to tie it all together.
So I see 2 options from the documentation
docker-image-tool.sh
script.But I'm not sure which option I need to go for or when to use either option? Why use one over the other? Is there another option? Is there a prebuilt image that I can just copy my application into and use that to run?
Upvotes: 1
Views: 2289
Reputation:
I think I confused you a little bit. Readying a spark distribution's image and packaging your app are two separate things. Here's how you'd deploy your app using Kuberentes as a scheduler.
Step 1: Build Spark's image
./bin/docker-image-tool.sh -r asia.gcr.io/menace -t v3.0.0 build
Step 2: Push the image to your Docker registry
./bin/docker-image-tool.sh -r asia.gcr.io/menace -t v3.0.0 push
Step 3: Configure your Kubernetes to be able to pull your image, In most cases, it just requires setting up imagePullSecrets
.
Pull Images from a private registry
Step 4: Write your spark app
package edu.girdharshubham
import scala.math.random
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
object Solution {
def main(args: Array[String]) {
val spark = SparkSession
.builder
.master("k8s://http://localhost:8001")
.config("spark.submit.deployMode","client")
.config("spark.executor.instances", "2")
.config("spark.kubernetes.container.image", "asia.gcr.io/menace/spark:v3.0.0")
.appName("sparkle")
.getOrCreate()
import spark.implicits._
val someDF = Seq(
(8, "bat"),
(64, "mouse"),
(-27, "horse")
).toDF("number", "word")
println("========================================================")
println(someDF.take(1).foreach(println))
println("========================================================")
spark.stop()
}
}
Step 4: Run your application
sbt run
This would result in executor pods being spawned on your cluster.
Step 5: Package your application
sbt package
Step 6: Use the spark submit command to run your app - Refer my initial answer
Now coming on to your question on packaging Spark's distribution, be careful about the version that you package and the dependencies that you use. Spark is a little iffy about versions.
Upvotes: 0
Reputation:
Spark's docker-image-tool.sh
is a tool script to create Spark's image. If you want a lightweight docker image, you can just tweak the Dockerfile that comes with the project or write your own - don't forget to edit the entrypoint.sh
script as well.
Generally the steps for getting your Spark App to Kubernetes goes like this:
-m
flag to move it to the minikube env.spark-submit
command.Note:
If you are not invested so much in Kubernetes and just want to quickly try this whole platform out, you can just proxy the kube-api-server
by running the following command:
kubectl proxy
It would start serving the api server on localhost:8001
, then you can submit your spark app by running a command like
bin/spark-submit \
--master k8s://http:localhost:8001 \ #If you don't specify the protocol here, it'd default to https
--deploy-mode client \ #if you want to go for `cluster mode` then you'd have to set up a service account
--name sparkle \
--class edu.girdharshubham \
--conf spark.executor.instances=1 \ #Number of pods
--conf spark.kubernetes.container.image=<your-spark-image>
--conf spark.kubernetes.driver.pod.name="sparkle"
--conf spark.kubernetes.hadoop.configMapName="HADOOP_CONF_DIR"
--conf spark.kubernetes.executor.deleteOnTermination
path/to/jar
Consideration for running in client mode:
init
for containers.spark.kubernetes.executor.deleteOnTermination
- This conf should be your go to conf if you are just starting out, by default pods are deleted in case of failure or normal termination. This would help you debug rather quickly about what's going on with your executor pods - whether they are failing or not.Upvotes: 3