Is it possible to run multiple Spark Structured Streaming jobs that write to S3 in parallel?

Question

I am trying to run multiple Spark Structured Streaming jobs (on EMR) that read from Kafka topics and write to different paths in S3 (each performed within their respective jobs). I have configured my cluster to use the CapacityScheduler. Here is a snippet of the code that I am trying to run:

df = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", ) \
    .option("subscribePattern", "") \
    .load() \
    .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

output = df \
    .writeStream \
    .format("json") \
    .outputMode("update") \
    .option("checkpointLocation", "s3://") \
    .option("path", "s3://") \
    .start() \
    .awaitTermination()

I tried running two jobs in parallel:

spark-submit --queue  --deploy-mode cluster --master yarn .py

spark-submit --queue  --deploy-mode cluster --master yarn .py

During execution, I noticed that the second job was not writing to S3 (even though the first job was). I also noticed a huge spike in the utilization via the Spark UI for the second job.

After stopping the first job, the data showed up for the second job in S3. Is it not possible to run two separate Spark Structured Streaming jobs that write to sinks (specifically on S3) in parallel? Does the write operation cause a some kind of blocking?

Is it possible to run multiple Spark Structured Streaming jobs that write to S3 in parallel?

Answers (1)

Related Questions