Spark Streaming Application not running jobs concurrently

Question

I have a Spark Streaming Job thats reading from Kafka and saving it into Redshift.

The batch of RDD contains data with a column "groupId", however the follow code below isn't running forEach concurrently, its running it serially in Yarn Client Mode.

Yarn Environment:

Client Mode
Default Scheduler FIFO
Exector Instances 3+
Executor: 2 Cores, 2 GB

  inputDstream.foreachRDD { eventRdd: RDD[Event] =>
      ...
      // Convert eventRdd to eventDF
      val groupIds = eventDF.select("group_id").distinct.collect.flatMap(_.toSeq)
      groupIds.par.foreach{ groupId =>
          val teventDF = eventDF.where($"group_id" <=> groupId)
          val teventDFWithVersion = teventDF.withColumn("schema_id", lit(version))
          teventDFWithVersion.write
            .format("io.github.spark_redshift_community.spark.redshift")
            .options(opts)
            .mode("Append")
            .save()
       }
  }

Again, the operation in groupsIds.par.foreach is running serially, instead of parallel. With increase in groups, my application starts to choke up and processing time spikes.

How do I get Spark to save my batches of data concurrently?

Spark Streaming Application not running jobs concurrently

Answers (1)

Related Questions