rombez
rombez

Reputation: 2198

Proper way to handle not needed/old/stale AWS Athena partitions

We have encountered a situation where we need to figure out how to handle stale AWS Athena partitions due to Athena's service limits (20,000 partitions per table).

Say we want to have only one table and add a number of partitions there every day that reference a timestamp-like path to some logs located on S3 (example: /foo_bucket/logs/year=2019/month=03/day=11/hour=20). We found out that this approach lets us reach the partition limit in about 2 years. After that, we want to clean up old logs and partitions that are associated with them.

Questions:

  1. What will happen with partition metadata if an associated S3 path gets removed? This S3 bucket automatically removes old objects. Will the partition automatically be deleted as well or will it reference a non-existent S3 path?
  2. What will happen when we reach partition limit per table before S3 automatically deletes old objects? Will old partition metadata get deleted by Athena? I know that deleting a partition does not touch S3 object data (link).

Thanks!

Upvotes: 2

Views: 1591

Answers (1)

Roberto
Roberto

Reputation: 847

Question 1:

As the metadata and data are totally unlinked in AWS Athena, you will need to call the command that basically says "hey, refresh my table partitions list" wich is MSCK REPAIR TABLE https://docs.aws.amazon.com/athena/latest/ug/msck-repair-table.html

Question 2:

I think you will get and error when Athena tries to add a new partition in the metastore as the old partition metadata is not removed.

Feedback:

Maybe it's not a bad idea to consolidate old content into less granular partitions. For example, you can keep a day granularity for the last 3 - 4 months. Anything older than that can be stored per month. That will allow you to keep much more content in your table.

I hope it helps :)

Upvotes: 1

Related Questions