Spark Streaming in Standalone Cluster takes the same Kafka message more than once

Question

My spark streaming application takes only once every record when I use it on local, but, when I deploy it on a standalone cluster it reads two times the same message from Kafka. Also, I've double checked that this is not a problem related to the kafka producer.

This is how I create the stream:

val stream = KafkaUtils.createDirectStream[String, String](ssc, PreferConsistent,
                             Subscribe[String, String]("aTopic", kafkaParams))

This is the kafkaParams configuration:

"bootstrap.servers" -> KAFKA_SERVER,
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "test-group",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)

The cluster has 2 workers with one executor per worker, it looks like every worker takes the same message. Anybody can help me, please?

EDIT

As an example, when I send one point from kafka. From this code:

    stream.foreachRDD((rdd, time) => {
          if (!rdd.isEmpty) {
            log.info("Taken "+rdd.count()+ " points")
        }
    }

I obtain "Taken 2 points". If I print them, they are equal. Am I doing something wrong?

I'm using

"org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.2.0"
spark 2.2.0
kafka_2.11-0.11.0.1

Spark Streaming in Standalone Cluster takes the same Kafka message more than once

Answers (0)

Related Questions