Reputation: 1647

Spark Stream kafka stream - returned rdd doesn't do anything in foreachRDD

I am facing a weird issue here, I am reading Avro records from kafka and trying to deserialize it and store it into a file. I am able to get the records from Kafka but some how when I try to use a function on the rdd records it refuses to do anything

import java.util.UUID
import io.confluent.kafka.serializers.KafkaAvroDecoder
import com.my.project.avro.AvroDeserializer
import com.my.project.util.SparkJobLogging
import io.confluent.kafka.schemaregistry.client.SchemaRegistryClient
import io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient
import org.apache.spark.streaming.{Minutes, Seconds, StreamingContext, Time}
import org.apache.spark.streaming.kafka._
import kafka.serializer.{DefaultDecoder, StringDecoder}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.dstream.{DStream}

object KafkaConsumer extends SparkJobLogging {
  var schemaRegistry: SchemaRegistryClient = null
  val url="url:8181"
  schemaRegistry= new CachedSchemaRegistryClient(url, 1000)

  def createKafkaStream(ssc: StreamingContext): DStream[(String,Array[Byte])] = {
    val kafkaParams = Map[String, String](
      "zookeeper.connect" -> "zk.server:2181",
      "group.id" -> s"${UUID.randomUUID().toString}",
      "auto.offset.reset" -> "smallest",
      "bootstrap.servers" -> "bootstrap.server:9092",
      "zookeeper.connection.timeout.ms" -> "6000",
      "schema.registry.url" ->"registry.url:8181"
    )

    val topic = "my.topic"
    KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc, kafkaParams, Set(topic))
  }

   def processRecord(  avroStream: Array[Byte])={
    println(AvroDeserializer.toRecord(avroStream, schemaRegistry) )
  }

  def main(args: Array[String]) = {
    val sparkConf = new SparkConf().setAppName("AvroDeserilizer")
    val sc = new SparkContext(sparkConf)
    val ssc = new StreamingContext(sc, Seconds(5))
    val topicStream = createKafkaStream(ssc)map(_._2)
    topicStream.foreachRDD(
      rdd => if (!rdd.isEmpty()){
        logger.info(rdd.count())
        rdd.foreach(avroRecords=> processRecord(avroRecords))
      }
    )
    ssc.start()
    ssc.awaitTermination()
  }
}

object AvroDeserializer extends SparkJobLogging{
  def toRecord(buffer: Array[Byte], registry: SchemaRegistryClient): GenericRecord = {
    val bb = ByteBuffer.wrap(buffer)
    bb.get() // consume MAGIC_BYTE
    val schemaId = bb.getInt // consume schemaId
    val schema = registry.getByID(schemaId) // consult the Schema Registry
    val reader = new GenericDatumReader[GenericRecord](schema)
    val decoder = DecoderFactory.get().binaryDecoder(buffer, bb.position(), bb.remaining(), null)
    reader.read(null, decoder) //null -> as we are not providing any datum
  }
}

Till statement logger.info(rdd.count()) everything works fine and I see the exact record counts in the log. However after that nothing works. When I tired

val record= rdd.first()
processRecord(record)

it worked but rdd.foreach(avroRecords=> processRecord(avroRecords)) and rdd.map(avroRecords=> processRecord(avroRecords)) doesn't works. It just prints below on every streaming call:

 17/05/14 01:01:24 INFO scheduler.DAGScheduler: Job 2 finished: foreach at KafkaConsumer.scala:56, took 42.684999 s
 17/05/14 01:01:24 INFO scheduler.JobScheduler: Finished job streaming job 1494738000000 ms.0 from job set of time 1494738000000 ms
 17/05/14 01:01:24 INFO scheduler.JobScheduler: Total delay: 84.888 s for time 1494738000000 ms (execution: 84.719 s)
 17/05/14 01:01:24 INFO scheduler.ReceivedBlockTracker: Deleting batches ArrayBuffer()
 17/05/14 01:01:24 INFO scheduler.InputInfoTracker: remove old batch metadata: 
 17/05/14 01:01:26 INFO yarn.YarnAllocator: Canceling requests for 0 executor containers
 17/05/14 01:01:26 WARN yarn.YarnAllocator: Expected to find pending requests, but found none.
 17/05/14 01:01:29 INFO yarn.YarnAllocator: Canceling requests for 0 executor containers
 17/05/14 01:01:29 WARN yarn.YarnAllocator: Expected to find pending requests, but found none.

It just prints the last 2 lines in the log till the next streaming context call.

Upvotes: 1

Answers (3)

Explorer

Reputation: 1647

Although the above method didn't worked for me but I found a different way in confluent documentation. The KafkaAvroDecoder will communicate with schema registry, get the schema and deserialize the data. So it removes the need of custom deserializer.

import io.confluent.kafka.serializers.KafkaAvroDecoder

val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers,
  "schema.registry.url" -> schemaRegistry,
  "key.converter.schema.registry.url" -> schemaRegistry,
  "value.converter.schema.registry.url" -> schemaRegistry,
  "auto.offset.reset" -> "smallest")
val topicSet = Set(topics)
val messages = KafkaUtils.createDirectStream[Object, Object, KafkaAvroDecoder, KafkaAvroDecoder](ssc, kafkaParams, topicSet).map(_._2)
messages.foreachRDD { 
rdd => if (!rdd.isEmpty()){
    logger.info(rdd.count())
    rdd.saveAsTextFile("/data/")
  }
)
ssc.start()
ssc.awaitTermination()
}
}

Dependency jar: kafka-avro-serializer-3.1.1.jar. This is working perfectly for me right now and I hope this will helpful to someone in future.

Upvotes: 0

Irshad Alam

Reputation: 73

val topicStream = createKafkaStream(ssc)map(_._2)
topicStream.foreachRDD(
  rdd => if (!rdd.isEmpty()){
    logger.info(rdd.count())
    rdd.foreach(avroRecords=> processRecord(avroRecords))

dstream.foreachRDD is a powerful primitive that allows data to be sent out to external systems. However, it is important to understand how to use this primitive correctly and efficiently. Some of the common mistakes to avoid are as follows.

DStreams are executed lazily by the output operations, just like RDDs are lazily executed by RDD actions. Specifically, RDD actions inside the DStream output operations force the processing of the received data. Hence, if your application does not have any output operation, or has output operations like dstream.foreachRDD() without any RDD action inside them, then nothing will get executed. The system will simply receive the data and discard it.

Upvotes: 1

Stephen

Reputation: 4296

Your println statements are being run on distributed workers not in the current process, so you dont see them. You could try replacing println with log.info verify this.

Ideally you should be turning your DStream[Array[Byte]] to a DStream[GenericRecord] and write that to a file, use .saveAsTextFiles or something. You may want a stream.take() in there because the stream could be infinite.

http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams

Upvotes: 1

Spark Stream kafka stream - returned rdd doesn&#39;t do anything in foreachRDD

Answers (3)

Related Questions

Spark Stream kafka stream - returned rdd doesn't do anything in foreachRDD