Reputation: 192
I have an athena table with partition based on date like this:
20190218
I want to delete all the partitions that are created last year.
I tried the below query, but it didnt work.
ALTER TABLE tblname DROP PARTITION (partition1 < '20181231');
ALTER TABLE tblname DROP PARTITION (partition1 > '20181010'), Partition (partition1 < '20181231');
Upvotes: 6
Views: 16158
Reputation: 8937
Posting the Glue API workaround for Java to save some time for these who need it:
public void deleteMetadataTablePartition(String catalog,
String db,
String table,
String expression) {
GetPartitionsRequest getPartitionsRequest = new GetPartitionsRequest()
.withCatalogId(catalog)
.withDatabaseName(db)
.withTableName(table)
.withExpression(expression);
List<PartitionValueList> partitionsToDelete = new ArrayList<>();
do {
GetPartitionsResult getPartitionsResult = this.glue.getPartitions(getPartitionsRequest);
List<PartitionValueList> partitionsValues = getPartitionsResult.getPartitions()
.parallelStream()
.map(p -> new PartitionValueList().withValues(p.getValues()))
.collect(Collectors.toList());
partitionsToDelete.addAll(partitionsValues);
getPartitionsRequest.setNextToken(getPartitionsResult.getNextToken());
} while (getPartitionsRequest.getNextToken() != null);
Lists.partition(partitionsToDelete, 25)
.parallelStream()
.forEach(partitionValueList -> {
glue.batchDeletePartition(
new BatchDeletePartitionRequest()
.withCatalogId(catalog)
.withDatabaseName(db)
.withTableName(table)
.withPartitionsToDelete(partitionValueList));
});
}
Upvotes: 0
Reputation: 185
this is the script the does what Theo recommended.
import json
import logging
import awswrangler as wr
import boto3
from botocore.exceptions import ClientError
logging.basicConfig(level=logging.INFO, format=logging.BASIC_FORMAT)
logger = logging.getLogger()
def delete_partitions(database_name: str, table_name: str):
client = boto3.client('glue')
paginator = client.get_paginator('get_partitions')
page_count = 0
partition_count = 0
for page in paginator.paginate(DatabaseName=database_name, TableName=table_name, MaxResults=20):
page_count = page_count + 1
partitions = page['Partitions']
partitions_to_delete = []
for partition in partitions:
partition_count = partition_count + 1
partitions_to_delete.append({'Values': partition['Values']})
logger.info(f"Found partition {partition['Values']}")
if partitions_to_delete:
response = client.batch_delete_partition(DatabaseName=database_name, TableName=table_name,
PartitionsToDelete=partitions_to_delete)
logger.info(f'Deleted partitions with response: {response}')
else:
logger.info('Done with all partitions')
def repair_table(database_name: str, table_name: str):
client = boto3.client('athena')
try:
response = client.start_query_execution(QueryString='MSCK REPAIR TABLE ' + table_name + ';',
QueryExecutionContext={'Database': database_name}, )
except ClientError as err:
logger.info(err.response['Error']['Message'])
else:
res = wr.athena.wait_query(query_execution_id=response['QueryExecutionId'])
logger.info(f"Query succeeded: {json.dumps(res, indent=2)}")
if __name__ == '__main__':
table = 'table_name'
database = 'database_name'
delete_partitions(database_name=database, table_name=table)
repair_table(database_name=database, table_name=table)
Upvotes: 2
Reputation: 132862
While the Athena SQL may not support it at this time, the Glue API call GetPartitions
(that Athena uses under the hood for queries) supports complex filter expressions similar to what you can write in a SQL WHERE
expression.
Instead of deleting partitions through Athena you can do GetPartitions
followed by BatchDeletePartition
using the Glue API.
Upvotes: 3
Reputation: 20710
According to https://docs.aws.amazon.com/athena/latest/ug/alter-table-drop-partition.html, ALTER TABLE tblname DROP PARTITION
takes a partition spec, so no ranges are allowed.
In Presto you would do DELETE FROM tblname WHERE ...
, but DELETE
is not supported by Athena either.
For these reasons, you need to do leverage some external solution.
For example:
Upvotes: 7