mxcolin
mxcolin

Reputation: 91

How to get basic Spark program running on Kubernetes

I'm trying to get off the ground with Spark and Kubernetes but I'm facing difficulties. I used the helm chart here:

https://github.com/bitnami/charts/tree/main/bitnami/spark

I have 3 workers and they all report running successfully. I'm trying to run the following program remotely:

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("spark://<master-ip>:<master-port>").getOrCreate()
df = spark.read.json('people.json')

Here's the part that's not entirely clear. Where should the file people.json actually live? I have it locally where I'm running the python code and I also have it on a PVC that the master and all workers can see at /sparkdata/people.json.

When I run the 3rd line as simply 'people.json' then it starts running but errors out with:

WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

If I run it as '/sparkdata/people.json' then I get

pyspark.sql.utils.AnalysisException: Path does not exist: file:/sparkdata/people.json

Not sure where I go from here. To be clear I want it to read files from the PVC. It's an NFS share that has the data files on it.

Upvotes: 1

Views: 699

Answers (1)

Koedlt
Koedlt

Reputation: 6001

Your people.json file needs to be accessible to your driver + executor pods. This can be achieved in multiple ways:

  • having some kind of network/cloud drive that each pod can access
  • mounting volumes on your pods, and then uploading the data to those volumes using --files in your spark-submit.

The latter option might be the simpler to set up. This page discusses in more detail how you could do this, but we can shortly go to the point. If you add the following arguments to your spark-submit you should be able to get your people.json on your driver + executors (you just have to choose sensible values for the $VAR variables in there):

  --files people.json \
  --conf spark.kubernetes.file.upload.path=$SOURCE_DIR \
  --conf spark.kubernetes.driver.volumes.$VOLUME_TYPE.$VOLUME_NAME.mount.path=$MOUNT_PATH \
  --conf spark.kubernetes.driver.volumes.$VOLUME_TYPE.$VOLUME_NAME.options.path=$MOUNT_PATH \
  --conf spark.kubernetes.executor.volumes.$VOLUME_TYPE.$VOLUME_NAME.mount.path=$MOUNT_PATH \
  --conf spark.kubernetes.executor.volumes.$VOLUME_TYPE.$VOLUME_NAME.options.path=$MOUNT_PATH \

You can always verify the existence of your data by going inside of the pods themselves like so:

kubectl exec -it <driver/executor pod name> bash
(now you should be inside of a bash process in the pod)
cd <mount-path-you-chose>
ls -al

That last ls -al command should show you a people.json file in there (after having done your spark-submit of course).

Hope this helps!

Upvotes: 1

Related Questions