Jithesh Gopinathan
Jithesh Gopinathan

Reputation: 448

How do partitions work in Spark Streaming?

I am working on performance improvement of a spark streaming application.

How the partition works in streaming environment. Is is same as loading a file in into spark or all the time it creates only one partition, making it work in only one core of the executor?

Upvotes: 4

Views: 2390

Answers (1)

Michael Heil
Michael Heil

Reputation: 18475

In Spark Streaming (not Structured) the partitioning works exactly as you know it from working with RDDs. You can easily check the number of partitions with

rdd.getNumPartitions

As you have also tagged spark-streaming-kafka it is worth mentioning that the number of partitions in your input DStream will match the number of partitions in the Kafka topic you are consuming.

In general, for RDDs, there are the HashPartitioner and the RangePartitioner for repartitions strategies available. You could use the HashPartitioner by

rdd.partitionBy(new HashPartitioner(2))

where rdd is a key-value paired RDD and 2 is the number of partitions.

Compared to the Structured API, RDDs also have the advantage to apply custom partitioners. For this, you can extend the Partitioner class and override the methods numPartitions and getPartitions like in the example given below:

import org.apache.spark.Partitioner

class TablePartitioner extends Partitioner {
  override def numPartitions: Int = 2
  override def getPartition(key: Any): Int = {
    val tableName = key.asInstanceOf[String]
    if(tableName == "foo") 0 // partition count start at 0
    else 1
  }
}

Upvotes: 4

Related Questions