Kafka Connect HDFS ignores flush.size in Confluent v4.0

Question

With the migration to Confluent v4.0, the flush.size for kafka-connect-hdfs doesn't work for me anymore. It worked with Confluent v3.x.

This is the current configuration file:

name=my-job-name
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=1

topics=my-topic-name

hdfs.url=hdfs://my/path
hadoop.conf.dir=/etc/hadoop/conf/
flush.size=50000
#rotate.interval.ms=80000

When I start the job, it generates millions of small avro files in HDFS.

-rw-r--r--   ...     43.8 K 2018-01-29 13:26 /my/path/topics/my-topic-name/partition=5/my-topic-name+5+0000000000+0000000143.avro
-rw-r--r--   ...      3.7 K 2018-01-29 13:26 /my/path/topics/my-topic-name/partition=5/my-topic-name+5+0000000144+0000000149.avro
...

As you can tell from the offset, some of the files only contain 6 events. What do I miss? Why do I see this new behavior?

OneCricketeer · Accepted Answer

The files will do this when the schemas of the messages change.

To inspect the schema messages (if using Avro), you can either get the schema ID of the message using its offset number in the file directly from Kafka, and hit the schema registry GET /schemas/(schema-id).

Or download the files from HDFS, and use avro-tools getschema command.

Kafka Connect HDFS ignores flush.size in Confluent v4.0

Answers (1)

Related Questions