Reputation: 448
I am working on performance improvement of a spark streaming application.
How the partition works in streaming environment. Is is same as loading a file in into spark or all the time it creates only one partition, making it work in only one core of the executor?
Upvotes: 4
Views: 2390
Reputation: 18475
In Spark Streaming (not Structured) the partitioning works exactly as you know it from working with RDD
s. You can easily check the number of partitions with
rdd.getNumPartitions
As you have also tagged spark-streaming-kafka it is worth mentioning that the number of partitions in your input DStream will match the number of partitions in the Kafka topic you are consuming.
In general, for RDDs, there are the HashPartitioner and the RangePartitioner for repartitions strategies available. You could use the HashPartitioner
by
rdd.partitionBy(new HashPartitioner(2))
where rdd
is a key-value paired RDD and 2
is the number of partitions.
Compared to the Structured API, RDDs also have the advantage to apply custom partitioners. For this, you can extend the Partitioner
class and override the methods numPartitions
and getPartitions
like in the example given below:
import org.apache.spark.Partitioner
class TablePartitioner extends Partitioner {
override def numPartitions: Int = 2
override def getPartition(key: Any): Int = {
val tableName = key.asInstanceOf[String]
if(tableName == "foo") 0 // partition count start at 0
else 1
}
}
Upvotes: 4