Guddi
Guddi

Reputation: 79

How to copy/move the downloaded files from S3 bucket to a different folder under the same bucket and not download load the latest file

I am using python 2.7.x, and Boto API 2.X to connect to AWS S3 bucket. I have a unique situation where I want to download files from S3 bucket that to from a specific directory/folder say myBucket/foo/. But the catch is I want to leave a latest file behind in S3 folder and not download it. Once, I download these files on my local box, I want to move these files out to a different folder under the same bucket say myBucket/foo/bar/. Has anyone worked on similar situation before?

Here is some explanation:

  1. Move downloaded files from an S3 bucket to a different folder path under the same bucket.

My S3 bucket : event-logs The folder path on S3 bucket from where files will be downloaded:

event-logs/apps/raw/source_data/

The folder path on S3 bucket where the downloaded files will be moved(archive):

event-logs/apps/raw/archive_data/ 

Note: The "event-logs/apps/raw/" path is common above under the same bucket

So if I have 5 files under source_data folder on S3:

s3://event-logs/apps/raw/source_data/data1.gz
event-logs/apps/raw/source_data/data2.gz
event-logs/apps/raw/source_data/data3.gz
event-logs/apps/raw/source_data/data4.gz
event-logs/apps/raw/source_data/data5.gz

I need to download first 4 files (oldest files) to my local machine and leave the latest file I.e. data5.gz behind. After the download is complete move those files from S3 ../source_data folder to ../Archive_data folder under the same S3 bucket and delete from the original source_data folder. Here is my code to list the files from S3, then to download files and then to delete the files.

AwsLogShip = AwsLogShip(aws_access_key, aws_secret_access_key, use_ssl=True)
bucket = AwsLogShip.getFileNamesInBucket(aws_bucket)
def getFileNamesInBucket(self, aws_bucket):
    if not self._bucketExists(aws_bucket):
        self._printBucketNotFoundMessage(aws_bucket)
        return list()
    else:
        bucket = self._aws_connection.get_bucket(aws_bucket)
        return map(lambda aws_file_key: aws_file_key.name, bucket.list("apps/raw/source_data/"))

AwsLogShip.downloadAllFilesFromBucket(aws_bucket, local_download_directory)
def downloadFileFromBucket(self, aws_bucket, filename, local_download_directory):
    if not self._bucketExists(aws_bucket):
        self._printBucketNotFoundMessage(aws_bucket)
    else:
        bucket = self._aws_connection.get_bucket(aws_bucket)
        for s3_file in bucket.list("apps/raw/source_data"):
            if filename == s3_file.name:
                self._downloadFile(s3_file, local_download_directory)
                Break;

AwsLogShip.deleteAllFilesFromBucket(aws_bucket)
def deleteFilesInBucketWith(self, aws_bucket, filename):
    if not self._bucketExists(aws_bucket):
        self._printBucketNotFoundMessage(aws_bucket)
    else:
        bucket = self._aws_connection.get_bucket(aws_bucket)
        for s3_file in filter(lambda fkey: filename(fkey.name), bucket.list("apps/raw/source_data/")):
            self._deleteFile(bucket, s3_file)

What I really want to achieve here is:

  1. Select a list of oldest files to download, which means always leave the latest modified file behind and not perform any action on it (as the idea is that file may not be ready to download or it is still being written).
  2. The same list of files which was downloaded.. needs to be moved into new location under the same bucket and delete those from the original source_data folder.

Upvotes: 0

Views: 2534

Answers (1)

Guddi
Guddi

Reputation: 79

This is how I solved this problem!

     bucket_list = bucket.list(prefix='Download/test_queue1/', delimiter='/')
     list1 = sorted(bucket_list, key= lambda item1: item1.last_modified)
     self.list2 = list1[:-1]
     for item in self.list2:
         self._bucketList(bucket, item)

    def _bucketList(self,bucket, item):
    print item.name, item.last_modified

Upvotes: 0

Related Questions