Yossi Vainshtein
Yossi Vainshtein

Reputation: 3985

AWS glue delete all partitions

I defined several tables in AWS glue.

Over the past few weeks, I've had different issues with the table definition which I had to fix manually - I want to change column names, or types, or change the serialization lib. However, If i already have partitions created, the repairing of table doesn't change them, and so I have to delete all partitions manually and then repairing.

Is there a simple way to do this? Delete all partitions from an AWS Glue table? I'm using aws batch-delete-partition CLI command, but it's syntax is tricky, and there are some limitations on the amount of partitions you can delete in one go, the whole thing is cumbersome...

Upvotes: 5

Views: 17391

Answers (4)

NickC
NickC

Reputation: 377

Here is a javascript version designed to run in a Lambda. Tested and functional in Lambda Node v16.

const AWS = require('aws-sdk');

// Set the region and initialize the Glue client
AWS.config.update({ region: 'us-east-1' });
const glue = new AWS.Glue();

exports.handler = async (event, context) => {

    // Specify the database and table names
    const databaseName = event.database;
    const tableName = event.table;

    console.log("database:", databaseName);
    console.log("table:", tableName);

    
    async function getAllPartitions(databaseName, tableName) {
        let partitions = [];
        let nextToken;

        do {
            const params = {
                DatabaseName: databaseName,
                TableName: tableName,
                NextToken: nextToken,
            };

            const response = await glue.getPartitions(params).promise();
            partitions = partitions.concat(response.Partitions || []);
            nextToken = response.NextToken;
        } while (nextToken);

        return partitions;
    }

    try {
        // Get a list of partitions for the table
        const partitions = await getAllPartitions(databaseName, tableName);
        console.log(`retrieved ${partitions.length} partitions`);

        // Delete all partitions in batches of 25
        const partitionBatches = [];
        for (let i = 0; i < partitions.length; i += 25) {
            partitionBatches.push(partitions.slice(i, i + 25));
        }

        await Promise.all(partitionBatches.map((partitionBatch) => {

            let partitionValuesArray = [];
            partitionBatch.forEach((partition => {
                partitionValuesArray.push({ Values: partition.Values });
            }));

            glue.batchDeletePartition({
                DatabaseName: databaseName,
                TableName: tableName,
                PartitionsToDelete: partitionValuesArray,
            }).promise();
        }));

        console.log(`Deleted ${partitions.length} partitions`);
    }
    catch (err) {
        console.error(err);
    }

};

Just swap out the event variables with the names manually, or set up test events to change targets. Or can be triggered programmatically however you wish.

Recommend increasing default timeout for large datasets. IAM role for Lambda requires Glue permissions.

Upvotes: 0

Golammott
Golammott

Reputation: 536

Using python3 with boto3 looks a little bit nicer. Albeit not by much :)

Unfortunately AWS doesn't provide a way to delete all partitions without batching 25 requests at a time. Note that this will only work for deleting the first page of partitions retrieved.

import boto3

glue_client = boto3.client("glue", "us-west-2")

def get_and_delete_partitions(database, table, batch=25):
    partitions = glue_client.get_partitions(
        DatabaseName=database,
        TableName=table)["Partitions"]

    for i in range(0, len(partitions), batch):
        to_delete = [{k:v[k]} for k,v in zip(["Values"]*batch, partitions[i:i+batch])]
        glue_client.batch_delete_partition(
            DatabaseName=database,
            TableName=table,
            PartitionsToDelete=to_delete)

EDIT: To delete all partitions (beyond just the first page) using paginators makes it look cleaner.

import boto3

glue_client = boto3.client("glue", "us-west-2")

def delete_partitions(database, table, partitions, batch=25):
    for i in range(0, len(partitions), batch):
      to_delete = [{k:v[k]} for k,v in zip(["Values"]*batch, partitions[i:i+batch])]
      glue_client.batch_delete_partition(
        DatabaseName=database,
        TableName=table,
        PartitionsToDelete=to_delete)

def get_and_delete_partitions(database, table):
    paginator = glue_client.get_paginator('get_partitions')
    itr = paginator.paginate(DatabaseName=database, TableName=table)
    
    for page in itr:
      delete_partitions(database, table, page["Partitions"])

Upvotes: 10

Yossi Vainshtein
Yossi Vainshtein

Reputation: 3985

For now, I found this command line solution, runinng aws glue batch-delete-partition iteratively for batches of 25 partitions using xargs

(here I am assuming there are max 1000 partitions):

aws glue get-partitions --database-name=<my-database> --table-name=<my-table> | jq -cr '[ { Values: .Partitions[].Values } ]' > partitions.json

seq 0 25 1000 | xargs -I _ bash -c "cat partitions.json | jq -c '.[_:_+25]'" | while read X; do aws glue batch-delete-partition --database-name=<my-database> --table-name=<my-table > --partitions-to-delete=$X; done

Hope it helps someone, but I'd prefer a more elegant solution

Upvotes: 16

Luigi Plinge
Luigi Plinge

Reputation: 51099

Here is a PowerShell version FWIW:

$database = 'your db name'
$table = 'your table name'
# Set the variables above

$batch_size = 25
Set-DefaultAWSRegion -Region eu-west-2
$partition_list = Get-GLUEPartitionList -DatabaseName $database -TableName $table

$selected_partitions = $partition_list
# Uncomment and edit predicate to select only certain partitions
# $selected_partitions = $partition_list | Where-Object {$_.Values[0] -gt '2020-07-20'} 

$selected_values = $selected_partitions | Select-Object -Property Values

for ($i = 0; $i -lt $selected_values.Count; $i += $batch_size) {
    $chunk = $selected_values[$i..($i + $batch_size - 1)]
    Remove-GLUEPartitionBatch -DatabaseName $database -TableName $table -PartitionsToDelete $chunk -Force
}

# Now run `MSCK REPAIR TABLE db_name.table_name` to add the partitions again

Upvotes: 2

Related Questions