Reputation: 395
I am trying to run multiple Spark Structured Streaming jobs (on EMR) that read from Kafka topics and write to different paths in S3 (each performed within their respective jobs). I have configured my cluster to use the CapacityScheduler. Here is a snippet of the code that I am trying to run:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", <BOOTSTRAP_SERVERS>) \
.option("subscribePattern", "<MY_TOPIC>") \
.load() \
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
output = df \
.writeStream \
.format("json") \
.outputMode("update") \
.option("checkpointLocation", "s3://<CHECKPOINT_LOCATION>") \
.option("path", "s3://<SINK>") \
.start() \
.awaitTermination()
I tried running two jobs in parallel:
spark-submit --queue <QUEUE_1> --deploy-mode cluster --master yarn <STREAM_1_SCRIPT>.py
spark-submit --queue <QUEUE_2> --deploy-mode cluster --master yarn <STREAM_2_SCRIPT>.py
During execution, I noticed that the second job was not writing to S3 (even though the first job was). I also noticed a huge spike in the utilization via the Spark UI for the second job.
After stopping the first job, the data showed up for the second job in S3. Is it not possible to run two separate Spark Structured Streaming jobs that write to sinks (specifically on S3) in parallel? Does the write operation cause a some kind of blocking?
Upvotes: 1
Views: 2923
Reputation: 310
Yes, you can ! That's it's not something that had multiple sources docummented but, the only thing that you need its share the spark context between your threads of multiple jobs. I make a multiple spark structtured streaming pipeline following this article https://cm.engineering/multiple-spark-streaming-jobs-in-a-single-emr-cluster-ca86c28d1411 any questions you can send me an email or talk inbox to me.
Thank you !
Upvotes: 3