Reputation: 481
I'm backing up my kafka topics to s3 using confluent's kafka-connect-s3 https://www.confluent.io/hub/confluentinc/kafka-connect-s3. I want to be able to easily query this data using Athena and have it properly partitioned for cheap/fast reads.
I want to partition by (year/month/day/topic) tuple. I already have the year/month/day part solved by using a Daily partitioner https://docs.confluent.io/kafka-connect-s3-sink/current/index.html#partitioning-records-into-s3-objects. Now year=YYYY/month=MM/day=DD is worked into the path so any hive-based querying is optimized / partitioned on time. Looking at msck explanation, notice the example using userid=
https://docs.aws.amazon.com/athena/latest/ug/msck-repair-table.html
However, based off these docs https://docs.confluent.io/kafka-connect-s3-sink/current/index.html#s3-object-names I get {topic} in the path but there's no way to modify it to topic={topic}. I could work this into the prefix (instead of env={env} the prefix would be env={env}/topic={topic}) but that seems redundant with another only-child directory {topic} underneath it.
I also noticed topic name is in the message name delimitated by + (along with partition and starting offset).
My question . . . how can I get topic={topic} in my path so hive-based queries automatically create that partition? Or do I already get that for free by having it in the path (with no topic=) or in the message name (again, with no topic=)
Upvotes: 0
Views: 746
Reputation: 191743
how can I get topic={topic} in my path so hive-based queries automatically create that partition?
There isn't, out of the box. You'd have to override the Partitioner class of the Connector. Otherwise, __filename__
, I think, is a special Hive column that can be queried and parsed, where files have topic names in them.
The recommendation would be to make a partitioned table per topic rather than have the topic be a partition itself
Upvotes: 1