Multiple kafka connects using S3 sink with same `topics.dir`

Question

I have multiple (independent) kafka clusters, each with topicA and topicB. Each of these clusters has a kafka-connect running with the S3 sink. Other than the AWS region they're running in, the setup is identical (though they do have different volumes of messages, which isn't relevant here).

Right now, I have topics.dir set up as s3://bucketname/{region}/{connector-name}. However, I would like that all the topicA data (from all regions) shares a common prefix and all the topicB data (from all regions) shares a different common prefix. I don't actually require that I write to a separate prefix for each region, but I'm assuming having the multiple kafka-connects trying to write to the same place is a recipe for disaster on the off-chance two try to write to the same S3 path.

It occurred to me that I could tweak the TimeBasedPartitioner to write to {region}/yyyy/mm/dd and change topics.dir to be s3://bucketname/{connector-name}, so my data would land in s3://bucketname/{connector-name}/{topic}/{region}/yyyy/mm/dd. Would this approach work? I don't see why not, though I have yet to try it. (The only obstacle I can imagine is the S3 sink requiring that topics.dir not exist when the connector first starts, but that's not documented anywhere, nor is S3 a file system...)

Multiple kafka connects using S3 sink with same `topics.dir`

Answers (0)

Related Questions