Why my Spark Streaming program processes so slow?

Question

I am currently writing a Spark Streaming. My task is pretty simple, just receiving json message from kafka and do some text filtering (contains TEXT1, TEXT2, TEXT3, TEXT4). The code looks like:

    val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
      ssc, kafkaParams, topics)


    messages.foreachRDD { rdd =>

      val originrdd = rdd.count()

      val record = rdd.map(_._2).filter(x=>x.contains(TEXT1)).filter( x=>x.contains(TEXT2)).filter(x=>x.contains(TEXT3)).filter(x=>x.contains(TEXT4))

      val afterrdd = record.count()

      println("original number of record: ", originrdd)
      println("after filtering number of records:", afterrdd)
}

It is about 4 kb for each JSON message, and around 50000 records from Kafka every 1 second.

The processing time, for the above task, takes 3 seconds for each batch, so it couldn't achieve real-time performance. I have storm for the same tasks, and it performs much faster.

Why my Spark Streaming program processes so slow?

Answers (1)

Related Questions