sakthi srinivas
sakthi srinivas

Reputation: 192

AWS Athena: Delete partitions between date range

I have an athena table with partition based on date like this:

20190218

I want to delete all the partitions that are created last year.

I tried the below query, but it didnt work.

ALTER TABLE tblname DROP PARTITION (partition1 < '20181231');

ALTER TABLE tblname DROP PARTITION (partition1 > '20181010'), Partition (partition1 < '20181231');

Upvotes: 6

Views: 16158

Answers (4)

Alex
Alex

Reputation: 8937

Posting the Glue API workaround for Java to save some time for these who need it:

public void deleteMetadataTablePartition(String catalog,
                                         String db,
                                         String table,
                                         String expression) {

    GetPartitionsRequest getPartitionsRequest = new GetPartitionsRequest()
            .withCatalogId(catalog)
            .withDatabaseName(db)
            .withTableName(table)
            .withExpression(expression);

    List<PartitionValueList> partitionsToDelete = new ArrayList<>();

    do {
        GetPartitionsResult getPartitionsResult = this.glue.getPartitions(getPartitionsRequest);
        List<PartitionValueList> partitionsValues = getPartitionsResult.getPartitions()
                .parallelStream()
                .map(p -> new PartitionValueList().withValues(p.getValues()))
                .collect(Collectors.toList());

        partitionsToDelete.addAll(partitionsValues);

        getPartitionsRequest.setNextToken(getPartitionsResult.getNextToken());
    } while (getPartitionsRequest.getNextToken() != null);

    Lists.partition(partitionsToDelete, 25)
            .parallelStream()
            .forEach(partitionValueList -> {
                glue.batchDeletePartition(
                        new BatchDeletePartitionRequest()
                                .withCatalogId(catalog)
                                .withDatabaseName(db)
                                .withTableName(table)
                                .withPartitionsToDelete(partitionValueList));
            });
}

Upvotes: 0

Khoa
Khoa

Reputation: 185

this is the script the does what Theo recommended.

import json
import logging

import awswrangler as wr
import boto3
from botocore.exceptions import ClientError

logging.basicConfig(level=logging.INFO, format=logging.BASIC_FORMAT)
logger = logging.getLogger()


def delete_partitions(database_name: str, table_name: str):
  client = boto3.client('glue')
  paginator = client.get_paginator('get_partitions')
  page_count = 0
  partition_count = 0
  for page in paginator.paginate(DatabaseName=database_name, TableName=table_name, MaxResults=20):
    page_count = page_count + 1
    partitions = page['Partitions']
    partitions_to_delete = []
    for partition in partitions:
      partition_count = partition_count + 1
      partitions_to_delete.append({'Values': partition['Values']})
      logger.info(f"Found partition {partition['Values']}")
    if partitions_to_delete:
      response = client.batch_delete_partition(DatabaseName=database_name, TableName=table_name,
        PartitionsToDelete=partitions_to_delete)
      logger.info(f'Deleted partitions with response: {response}')
    else:
      logger.info('Done with all partitions')


def repair_table(database_name: str, table_name: str):
  client = boto3.client('athena')
  try:
    response = client.start_query_execution(QueryString='MSCK REPAIR TABLE ' + table_name + ';',
      QueryExecutionContext={'Database': database_name}, )
  except ClientError as err:
    logger.info(err.response['Error']['Message'])
  else:
    res = wr.athena.wait_query(query_execution_id=response['QueryExecutionId'])
    logger.info(f"Query succeeded: {json.dumps(res, indent=2)}")


if __name__ == '__main__':
  table = 'table_name'
  database = 'database_name'
  delete_partitions(database_name=database, table_name=table)
  repair_table(database_name=database, table_name=table)

Upvotes: 2

Theo
Theo

Reputation: 132862

While the Athena SQL may not support it at this time, the Glue API call GetPartitions (that Athena uses under the hood for queries) supports complex filter expressions similar to what you can write in a SQL WHERE expression.

Instead of deleting partitions through Athena you can do GetPartitions followed by BatchDeletePartition using the Glue API.

Upvotes: 3

Piotr Findeisen
Piotr Findeisen

Reputation: 20710

According to https://docs.aws.amazon.com/athena/latest/ug/alter-table-drop-partition.html, ALTER TABLE tblname DROP PARTITION takes a partition spec, so no ranges are allowed.

In Presto you would do DELETE FROM tblname WHERE ..., but DELETE is not supported by Athena either.

For these reasons, you need to do leverage some external solution.

For example:

  1. list the files as in https://stackoverflow.com/a/48824373/65458
  2. delete the files and containing directories
  3. update partitions information (https://docs.aws.amazon.com/athena/latest/ug/msck-repair-table.html should be helpful)

Upvotes: 7

Related Questions