Kafka-spark Streaming processing jobs synchronically

Question

Im trying a simple test where i use Kafka-connect and spark

I wrote a custom kafka-connect that creates this source record

SourceRecord sr = new SourceRecord(null,
                    null,
                    destTopic,
                   Schema.STRING_SCHEMA,
                    cleanPath);

in the spark i receive this message like this

val kafkaConsumerParams = Map[String, String](
      "metadata.broker.list" -> prop.getProperty("kafka_host"),
      "zookeeper.connect" -> prop.getProperty("zookeeper_host"),
      "group.id" -> prop.getProperty("kafka_group_id"),
      "schema.registry.url" -> prop.getProperty("schema_registry_url"),
      "auto.offset.reset" -> prop.getProperty("auto_offset_reset")
    )
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaConsumerParams, topicsSet)

val ds = messages.foreachRDD(rdd => {
          val toPrint = rdd.map(t => {
            val file_path = t._2

            val startTime = DateTime.now()


            Thread.sleep(1000 * 60)
            1
      }).sum()
        LogUtils.getLogger(classOf[DeviceManager]).info(" toPrint = " + toPrint +" (number of flows calculated)")
      })
    }

when i use the connector to send multiple message to the desired topic ( in my test it had 6 partitions) The sleep thread gets all the messages, but preforms them synchronically instead of asynchronically.

When i create a simple test producer, the sleeps are done asynchronically.

I Also created 2 simple consumers, and tried both the connector and a producer, and both task were consumed asynchronically which means my problems lays with the way the spark is receiving the messages sent from the connector. I cant figure why the tasks are not acting the same way as they do when i send it from a producer.

i even printed the record the spark recieves and they are exactly the same

producer sent record

1: {partition=2, offset=11, value=something, key=null}
2: {partition=5, offset=9, value=something2, key=null}

connect sent record

1: {partition=3, offset=9, value=something, key=null}

the versions used in my projects are

dependencies

 
            io.confluent
            kafka-avro-serializer
            ${confluent.version}
        
        
            io.confluent
            kafka-schema-registry-client
            ${confluent.version}
        
        
            org.apache.avro
            avro
            1.8.0
        
        
            org.apache.spark
            spark-core_2.11
            ${spark.version}
        
        
            org.apache.spark
            spark-sql_2.11
            ${spark.version}
        
        
            org.apache.spark
            spark-streaming_2.11
            ${spark.version}
        
        
            org.apache.spark
            spark-streaming-kafka_2.11
            1.6.3
        
        
            org.apache.spark
            spark-graphx_2.11
            ${spark.version}
            provided
        
        
            com.datastax.spark
            spark-cassandra-connector_2.11
            2.0.0-RC1
        
        
            org.scala-lang
            scala-library
            2.8.0
        

            io.confluent
            kafka-avro-serializer
            ${confluent.version}
            ${global.scope}
        
        
            io.confluent
            kafka-connect-avro-converter
            ${confluent.version}
            ${global.scope}
        
        
            org.apache.kafka
            connect-api
            ${kafka.version}

himanshuIIITian · Accepted Answer

We cannot run Spark-Kafka streaming jobs asynchronously. But we can run them in parallel, as Kafka consumer(s) do. For that, we need to set following configuration in SparkConf():

sparkConf.set("spark.streaming.concurrentJobs","4")

By default, its value is "1". But we can override it to a higher value.

I hope this helps!

Kafka-spark Streaming processing jobs synchronically

Answers (1)

Related Questions