Reputation: 700

Recursively copying Content from one path to another of s3 buckets using boto in python

I am not able to find any solution for recusively copying contents from one to another in s3 buckets using boto in python.

suppose a bucket B1 contains has key structure like: B1/x/* I want to copy all the objects recursively from key like B/x/* to B/y/*

Upvotes: 11

Answers (4)

kd88

Reputation: 1204

Instead of using boto3, I opt for aws-cli and sh. See the aws s3 cp docs for full list of arguments, which you can include as kwargs in the following (reworked from my own code) which can be used to copy to / from / between S3 buckets and / or local targets:

import sh  #  also assumes aws-cli has been installed

def s3_cp(source, target, **kwargs):
    """
    Copy data from source to target. Include flags as kwargs
    such as recursive=True and include=xyz
    """
    args = []
    for flag_name, flag_value in kwargs.items():
        # negating a boolean flag implies omitting the flag all together
        if flag_value is False:
            continue
        # always include the flag name
        args.append(f"--{flag_name}")
        # include the flag value if it's not boolean
        if flag_value is not True:
            args.append(flag_value)
    args += [source, target]

bucket to bucket (as per the OP's question):

s3_cp("s3://B1/x/", "s3://B1/y/", quiet=True, recursive=True)

or bucket to local:

s3_cp("s3://B1/x/", "my-local-dir/", quiet=True, recursive=True)

Personally I found that this method gave improved transfer time (of a few GB over 20k small files) from a couple of hours to a few minutes compared to boto3. Perhaps under the hood it's doing some threading or simply opening few connections - but that's just speculation.

Warning: it won't work on Windows.

Upvotes: 0

swimmer

Reputation: 3333

Another boto3 alternative, using the higher level resource API rather than client:

import os

import boto3


def copy_prefix_within_s3_bucket(
    endpoint_url: str,
    bucket_name: str,
    old_prefix: str,
    new_prefix: str,
) -> None:
    bucket = boto3.resource(
        "s3",
        endpoint_url=endpoint_url,
        aws_access_key_id=os.environ["AWS_ACCESS_KEY_ID"],
        aws_secret_access_key=os.environ["AWS_SECRET_ACCESS_KEY"],
    ).Bucket(bucket_name)
    for obj in bucket.objects.filter(Prefix=old_prefix):
        old_key = obj.key
        new_key = old_key.replace(old_prefix, new_prefix)
        copy_source = {"Bucket": bucket_name, "Key": old_key}
        bucket.copy(copy_source, new_key)


if __name__ == "__main__":
    copy_prefix_within_s3_bucket(
        endpoint_url="my_endpoint_url",
        bucket_name="my_bucket_name",
        old_prefix="my_old_prefix",
        new_prefix="my_new_prefix",
    )

Upvotes: 0

Suyash Rathi

Reputation: 214

Just trying to build on previous answer:

s3 = boto3.client('s3')


def copyFolderFromS3(pathFrom, bucketTo, locationTo):
    response = {}
    response['status'] = 'failed'
    getBucket = pathFrom.split('/')[2]
    location = '/'.join(pathFrom.split('/')[3:])
    if pathFrom.startswith('s3://'):
        copy_source = { 'Bucket': getBucket, 'Key': location }
        uploadKey = locationTo
        recursiveCopyFolderToS3(copy_source,bucketTo,uploadKey)


def recursiveCopyFolderToS3(src,uplB,uplK):
    more_objects=True
    found_token = True
    while more_objects:
        if found_token:
            response = s3.list_objects_v2(
                Bucket=src['Bucket'], 
                Prefix=src['Key'],
                Delimiter="/")
        else:   
            response = s3.list_objects_v2(
                Bucket=src['Bucket'],
                ContinuationToken=found_token,
                Prefix=src['Key'],
                Delimiter="/")
        for source in response["Contents"]:
            raw_name = source["Key"].split("/")[-1]
            raw_name = raw_name
            new_name = os.path.join(uplK,raw_name)
            if raw_name.endswith('_$folder$'):
                src["Key"] = source["Key"].replace('_$folder$','/')
                new_name = new_name.replace('_$folder$','')
                recursiveCopyFolderToS3(src,uplB,new_name)
            else:
                src['Key'] = source["Key"]
                s3.copy_object(CopySource=src,Bucket=uplB,Key=new_name)       
                if "NextContinuationToken" in response:
                    found_token = response["NextContinuationToken"]
                    more_objects = True
                else:
                    more_objects = False

Or you an also use the simple awscli which is by default installed on EC2/emr machines.

import subprocess

cmd='aws s3 cp '+path+' '+uploadUrl+' --recursive' 
p=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
p.communicate()

Upvotes: 6

mootmoot

Reputation: 13176

There is not "directory" in S3. Those "/" separator is just part of object name, that's why boto doesn't have such features. Either write a script to deal with it or use third party tools.

AWS customerapps show s3browser that provide such arbitrary directory copying functionality. The typical free version only spawn two threads to move file, the paid version allow you to specify more threads and run faster.

Or you just write script and use s3.client.copy_object to copy the file to another name, then delete them afterwards. e.g.

import boto3
s3 = boto3.client("s3")
# list_objects_v2() give more info

more_objects=True
found_token = True
while more_objects :
  if found_token :
    response= s3.list_objects_v2(
      Bucket="mybucket", 
      Prefix="B1/x/",
      Delimiter="/")
  else:   
    response= s3.list_objects_v2(
      Bucket="mybucket",
      ContinuationToken=found_token,
      Prefix="B1/x/",
      Delimiter="/")
  # use copy_object or copy_from
  for source in object_list["Contents"]:
    raw_name = source["Key"].split("/")[-1] 
    new_name = "new_structure/{}".format(raw_name)
    s3.copy_object(
      ....
    )       
    # Now check there is more objects to list
    if "NextContinuationToken" in response:
      found_token = response["NextContinuationToken"]
      more_objects = True
    else:
      more_objects = False

** IMPORTANT NOTES ** : list_object only return maximum 1000 keys per listing, MaxKey will not change the limit. So you must use list_objects_v2 and check whether NextContinuationToken is returned, to make sure the is more object, repeat it until exhausted.

Upvotes: 7

Recursively copying Content from one path to another of s3 buckets using boto in python

Answers (4)

Related Questions