tarun kumar Sharma
tarun kumar Sharma

Reputation: 917

Last modified Time file list in aws s3 using python

I have multiple keys under my aws s3 bucket. The structure is :

bucket/tableName1/Archive/archive1.json -to- bucket/tableName1/Archive/archiveN.json bucket/tableName2/Archive/archive2.json -to- bucket/tableName2/Archive/archiveN.json bucket/tableName1/Audit/audit1.json -to- bucket/tableName1/Audit/auditN.json bucket/tableName2/Audit/audit2.json -to- bucket/tableName2/Audit/auditN.json

I want to get the keys from the Audit folder only if it is present in a key and get only the the latest file i.e. which has the last modified time as most recent from that Audit folder.

The result that I am trying to get is a list of dictionary :

[{'tableName1' : 'auditN.json'}, {'tableName2' : 'auditN.json'}]

Assuming auditN.json is the newest file.

I tried different methods but i am not getting the desired result.I am trying the solution on databricks notebook. Is there a way that I can achieve this ?

Upvotes: 2

Views: 1855

Answers (1)

Amit Baranes
Amit Baranes

Reputation: 8152

Well, I've been reading and searching over a lot of threads about what you're asking but no luck. So, I had to write my own lambda function.

The following code snippet iterate over all folders, then iterate over the subfolders check if the subfolder name == Audit, if it does- sort by last modified and print the newest object.

Be aware that this code fits your structure only! since list_folders function return only the first subfolders.

In case your structure changed to something like that:

bucket/tableName1/Audit/Audit1/audit.json

The lambda won't work.

Code snippet :

import boto3

#bucket Name
bucket_name = 'Bucket Name'
#bucket Resource
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucket_name)

#bucket client
s3_client = boto3.client("s3")

#filter pattern 
get_last_modified = lambda obj: int(obj.last_modified.strftime('%s'))

#get subfolder - 1 LEVEL ONLY ! 
def list_folders(s3_client, bucket_name,prefix):
    response = s3_client.list_objects_v2(Bucket=bucket_name,Prefix=prefix, Delimiter='/')
    for content in response.get('CommonPrefixes', []):
        yield content.get('Prefix')

def lambda_handler(event, context):
    #get all folders 
    folder_list = list_folders(s3_client, bucket_name,'')
    for folder in folder_list:
        #get all subfolders
        subfolders =  list_folders(s3_client, bucket_name,folder)
        for subfolder in subfolders:
            #iterate over subfolders and check if subfolder name equal to Audit
            if 'Audit' == subfolder.split('/')[1]:
                #get all objects under subfolder
                objs = [obj for obj in bucket.objects.filter(Prefix= subfolder)]
                #sort by last modified by filter pattern and get the first object 
                last_modified_file = [obj for obj in sorted(objs, key=get_last_modified)][-1]
                #print results
                print('Last modified file Name: %s ---- Date: %s' % (last_modified_file.key,last_modified_file.last_modified))

Tested against the following files: enter image description here

Table2 subfolder named Archive. enter image description here

enter image description here

Output :

enter image description here

Hope you will find it helpful.

Upvotes: 1

Related Questions