Karthik
Karthik

Reputation: 471

In S3 Bucket , Move large number of files in one folder into multiple folder

Currently, I have 30 Million files in one folder in an S3 bucket I want to move 7.5 million files from it into 4 folders in an S3 bucket

I tried out with the AWS CLI command but no idea how to mention the number of files in it

aws s3 mv s3://BUCKETNAME/myfolder/ s3://BUCKETNAME/folder1/ --recursive

How can I loop and move only 7.5 million files into each folder?

import boto3

aws_access_key_id = ""
aws_secret_access_key = ""
bucket_from = ""
bucket_to = ""
s3 = boto3.resource(
    's3',
    aws_access_key_id=aws_access_key_id,
    aws_secret_access_key=aws_secret_access_key
)
src = s3.Bucket(bucket_from)

def move_files():
    for archive in src.objects.all():

        s3.meta.client.copy_object(
            ACL='public-read',
            Bucket=bucket_to,
            CopySource={'Bucket': bucket_from, 'Key': archive.key},
            Key=archive.key
        )

move_files()

Upvotes: 2

Views: 2596

Answers (1)

John Rotenstein
John Rotenstein

Reputation: 269091

I would recommend:

1. Obtain object listing using Amazon S3 Inventory

Listing millions of objects can take a long time. Instead, use Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects.

This will provide you with a definitive list of current objects.

2. Split into 4 lists

Use a text editor to split the file list into 4 separate files -- one for each of your destination folders.

3. Use Amazon S3 Batch Operations to copy objects

Copying millions of objects would take a long time unless you multi-thread the process.

The easier and faster method would be to Perform large-scale batch operations on Amazon S3 objects using S3 Batch Operations. It can take the S3 Inventory file as input and then perform all the copy operations for you in parallel.

S3 Batch Operations

4. Clean-up

I recommend that you do not delete the source files until you are sure that all the copying was done correctly. You can again use S3 Inventory to obtain a list for comparison purposes.

Once you want to delete the source files, you can use S3 Lifecycle to delete the original objects. Be very careful that you do not delete the copied objects at the same time!! For this reason alone it might be better to copy the objects to a different bucket from the source files.

Upvotes: 1

Related Questions