kalyan chakravarthy
kalyan chakravarthy

Reputation: 653

FIFO processing using Spark Streaming?

I have a use case where i have to process the events in FIFO fashion. These are the events generated from machines. each machines generates one event per 30 sec. For particular machine we need to process the events based on FIFO fasion.

we need to process around 240 million events per day. For such a massive scale we need to use Kafka+Spark Streaming

From Kafka documentation i understand that we can use key field of message to route the message to particular topic partition. This ensures that i can use machine id as key and ensure that all messages from particular machine land into same topic partition.

50 % of problem solved.

Here comes the Question at processing side.

The spark Documentation of Kafka Direct approach says RDD partitions are equivalent to Kafka partitions.

So when i execute rdd.foreachPartition does task iterate in ordered fasion ?

Is it ensured that a partition of RDD always lies with in one executor?

Is it ensured that foreachPartition task is executed only by one thread for entire partition ?

Please help.

Upvotes: 1

Views: 517

Answers (2)

panther
panther

Reputation: 767

From Kafka documentation i understand that we can use key field of message to route the message to particular topic partition. This ensures that i can use machine id as key and ensure that all messages from particular machine land into same topic partition.

While publishing data to Kafka, you do not need to use machine id. Use null as key and kafka will internally use Hash partitioning scheme to send the data appropriately to different kafka hosts.

Here comes the Question at processing side.

Gotcha: When you are processing in spark, it is not going to have a global order. Example: There are 5 events (ordered by time): e0 (earliest), e1, e2, e3, e4 (latest)

These get routed to different kafka partitions:

Kakfa Partition P0: e0, e3 Kafka Partition P1: e1, e2, e4

So when you are reading in your spark job, you will get e0, e3 in one RDD and e1, e2, e4 in another RDD, in that order.

If you want global ordering, (e0, e1, e2, e3, e4), you will need to write to single partition in kafka. But then you will lose partition tolerance and run into some performance issues (need to tune producers and consumers). 3000 events/sec should be fine, but that also depends on your kafka cluster.

Your other questions have already been answered by @zsxwing (see)

Upvotes: 0

zsxwing
zsxwing

Reputation: 20826

Let's say you don't use any operators that repartition the data (e.g., repartition, reduceByKey, reduceByKeyAndWindow, ...).

So when i execute rdd.foreachPartition does task iterate in ordered fasion ?

Yes. It processes the data following the order in the Kafka partition.

is it ensured that a partition of RDD always lies with in one executor?

Yes. There is only one executor (task) processing a partition if you don't enable speculation. speculation may launch another task to run the same partition if it's too slow.

is it ensured that foreachPartition task is executed only by one thread for entire partition ?

Yes. It processes the data in one partition one by one.

Upvotes: 1

Related Questions