How to get basic Spark program running on Kubernetes

Question

I'm trying to get off the ground with Spark and Kubernetes but I'm facing difficulties. I used the helm chart here:

https://github.com/bitnami/charts/tree/main/bitnami/spark

I have 3 workers and they all report running successfully. I'm trying to run the following program remotely:

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("spark://:").getOrCreate()
df = spark.read.json('people.json')

Here's the part that's not entirely clear. Where should the file people.json actually live? I have it locally where I'm running the python code and I also have it on a PVC that the master and all workers can see at /sparkdata/people.json.

When I run the 3rd line as simply 'people.json' then it starts running but errors out with:

WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

If I run it as '/sparkdata/people.json' then I get

pyspark.sql.utils.AnalysisException: Path does not exist: file:/sparkdata/people.json

Not sure where I go from here. To be clear I want it to read files from the PVC. It's an NFS share that has the data files on it.

How to get basic Spark program running on Kubernetes

Answers (1)

Related Questions