John Doe
John Doe

Reputation: 145

Design flaw in Spark Streaming App by using only one partition?

I am developing a streaming app with Apache Spark. The app receives sensor data by subscribing to a Kafka topic named sensor. The purpose of the app is to filter the sensor data, transform it and publish it back to a different Kafka topic named people for other consumers. The messages in topic people must have the same order as they arrived in topic sensor. Thus, I am currently using only one partition in Kafka.

Here's my code:

val myStream = KafkaUtils.createDirectStream[K, V](streamingContext, PreferConsistent, Subscribe[K, V](topics, consumerConfig))

def process(record: (RDD[ConsumerRecord[String, String]], Time)): Unit = record match {
case (rdd, time) if !rdd.isEmpty =>
    // More Code...
    // Filter RDD, transform to JSON, build Seq[People]...
    // In the end, I have: Dataset[People]
    // Publish to Kafka topic 'people'
case _ =>
}

myStream.foreachRDD((x, y) => process((x, y)))

Today, I asked a question on how to achieve the correct ordering in Spark, after transforming it into my People data structure.

The answer indicated that using Spark with a single partition is not wise and that this actually might be a design flaw:

Unless you have a single partition (and then you wouldn't use Spark, would you?) the order...

I am now wondering whether I can improve the overall design of my application (change the map-reduce flow) or if Spark is not a good fit for my use case.

Upvotes: 2

Views: 96

Answers (2)

wandermonk
wandermonk

Reputation: 7386

In your case Kafka is not the right choice. Kafka only maintains the total order over the messages within a partition. The parallelism or scalability of Kafka is purely depended on the no: of partitions on a particular topic. The flaw is completely with the design.

If you really want to preserve the order you can have a epoch timestamp in your data and once you transform the data you can sort the data and store it.

Upvotes: 0

user9660922
user9660922

Reputation: 11

While this is primarily opinion based you are using tools which are designed for:

  • fault tolerant,
  • distributed,
  • parallel,
  • processing, without specific order guarantees

to solve problem defined as:

  • sequential,
  • non-distributed,
  • with strict order guarantees,
  • possibly breaking fault tolerance (due to large amount of data placed on a single executor).

where:

  • single threaded consumer from fault tolerant queue

would be perfectly sufficient.

So subjectively speaking there is a serious design flaw here.

Upvotes: 1

Related Questions