Reputation: 2506
With the migration to Confluent v4.0, the flush.size
for kafka-connect-hdfs doesn't work for me anymore. It worked with Confluent v3.x.
This is the current configuration file:
name=my-job-name
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=1
topics=my-topic-name
hdfs.url=hdfs://my/path
hadoop.conf.dir=/etc/hadoop/conf/
flush.size=50000
#rotate.interval.ms=80000
When I start the job, it generates millions of small avro files in HDFS.
-rw-r--r-- ... 43.8 K 2018-01-29 13:26 /my/path/topics/my-topic-name/partition=5/my-topic-name+5+0000000000+0000000143.avro
-rw-r--r-- ... 3.7 K 2018-01-29 13:26 /my/path/topics/my-topic-name/partition=5/my-topic-name+5+0000000144+0000000149.avro
...
As you can tell from the offset, some of the files only contain 6 events. What do I miss? Why do I see this new behavior?
Upvotes: 1
Views: 547
Reputation: 191743
The files will do this when the schemas of the messages change.
To inspect the schema messages (if using Avro), you can either get the schema ID of the message using its offset number in the file directly from Kafka, and hit the schema registry GET /schemas/(schema-id)
.
Or download the files from HDFS, and use avro-tools getschema
command.
Upvotes: 1