lserlohn
lserlohn

Reputation: 6216

Why my Spark Streaming program processes so slow?

I am currently writing a Spark Streaming. My task is pretty simple, just receiving json message from kafka and do some text filtering (contains TEXT1, TEXT2, TEXT3, TEXT4). The code looks like:

    val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
      ssc, kafkaParams, topics)


    messages.foreachRDD { rdd =>

      val originrdd = rdd.count()

      val record = rdd.map(_._2).filter(x=>x.contains(TEXT1)).filter( x=>x.contains(TEXT2)).filter(x=>x.contains(TEXT3)).filter(x=>x.contains(TEXT4))

      val afterrdd = record.count()

      println("original number of record: ", originrdd)
      println("after filtering number of records:", afterrdd)
}

It is about 4 kb for each JSON message, and around 50000 records from Kafka every 1 second.

The processing time, for the above task, takes 3 seconds for each batch, so it couldn't achieve real-time performance. I have storm for the same tasks, and it performs much faster.

Upvotes: 0

Views: 1327

Answers (1)

OneCricketeer
OneCricketeer

Reputation: 191953

Well, you have made 3 unnecessary RDD's in this process.

val record = rdd.map(_._2).filter(x => { 
    x.contains(TEXT1) && 
    x.contains(TEXT2) && 
    x.contains(TEXT3) && 
    x.contains(TEXT4)
}

Also, worth reading. https://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning

Upvotes: 1

Related Questions