Reputation: 917
I have multiple keys under my aws s3 bucket. The structure is :
bucket/tableName1/Archive/archive1.json -to- bucket/tableName1/Archive/archiveN.json bucket/tableName2/Archive/archive2.json -to- bucket/tableName2/Archive/archiveN.json bucket/tableName1/Audit/audit1.json -to- bucket/tableName1/Audit/auditN.json bucket/tableName2/Audit/audit2.json -to- bucket/tableName2/Audit/auditN.json
I want to get the keys from the Audit folder only if it is present in a key and get only the the latest file i.e. which has the last modified time as most recent from that Audit folder.
The result that I am trying to get is a list of dictionary :
[{'tableName1' : 'auditN.json'}, {'tableName2' : 'auditN.json'}]
Assuming auditN.json is the newest file.
I tried different methods but i am not getting the desired result.I am trying the solution on databricks notebook. Is there a way that I can achieve this ?
Upvotes: 2
Views: 1855
Reputation: 8152
Well, I've been reading and searching over a lot of threads about what you're asking but no luck. So, I had to write my own lambda function.
The following code snippet iterate over all folders, then iterate over the subfolders check if the subfolder name == Audit, if it does- sort by last modified and print the newest object.
Be aware that this code fits your structure only! since list_folders
function return only the first subfolders.
In case your structure changed to something like that:
bucket/tableName1/Audit/Audit1/audit.json
The lambda won't work.
Code snippet :
import boto3
#bucket Name
bucket_name = 'Bucket Name'
#bucket Resource
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucket_name)
#bucket client
s3_client = boto3.client("s3")
#filter pattern
get_last_modified = lambda obj: int(obj.last_modified.strftime('%s'))
#get subfolder - 1 LEVEL ONLY !
def list_folders(s3_client, bucket_name,prefix):
response = s3_client.list_objects_v2(Bucket=bucket_name,Prefix=prefix, Delimiter='/')
for content in response.get('CommonPrefixes', []):
yield content.get('Prefix')
def lambda_handler(event, context):
#get all folders
folder_list = list_folders(s3_client, bucket_name,'')
for folder in folder_list:
#get all subfolders
subfolders = list_folders(s3_client, bucket_name,folder)
for subfolder in subfolders:
#iterate over subfolders and check if subfolder name equal to Audit
if 'Audit' == subfolder.split('/')[1]:
#get all objects under subfolder
objs = [obj for obj in bucket.objects.filter(Prefix= subfolder)]
#sort by last modified by filter pattern and get the first object
last_modified_file = [obj for obj in sorted(objs, key=get_last_modified)][-1]
#print results
print('Last modified file Name: %s ---- Date: %s' % (last_modified_file.key,last_modified_file.last_modified))
Tested against the following files:
Table2
subfolder named Archive
.
Output :
Hope you will find it helpful.
Upvotes: 1