Reputation: 6216
I am currently writing a Spark Streaming. My task is pretty simple, just receiving json message from kafka and do some text filtering (contains TEXT1, TEXT2, TEXT3, TEXT4). The code looks like:
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics)
messages.foreachRDD { rdd =>
val originrdd = rdd.count()
val record = rdd.map(_._2).filter(x=>x.contains(TEXT1)).filter( x=>x.contains(TEXT2)).filter(x=>x.contains(TEXT3)).filter(x=>x.contains(TEXT4))
val afterrdd = record.count()
println("original number of record: ", originrdd)
println("after filtering number of records:", afterrdd)
}
It is about 4 kb for each JSON message, and around 50000 records from Kafka every 1 second.
The processing time, for the above task, takes 3 seconds for each batch, so it couldn't achieve real-time performance. I have storm for the same tasks, and it performs much faster.
Upvotes: 0
Views: 1327
Reputation: 191953
Well, you have made 3 unnecessary RDD's in this process.
val record = rdd.map(_._2).filter(x => {
x.contains(TEXT1) &&
x.contains(TEXT2) &&
x.contains(TEXT3) &&
x.contains(TEXT4)
}
Also, worth reading. https://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning
Upvotes: 1