Reputation: 23
I'm new to AWS Lambda and am trying to create a lambda function that will be invoked by S3 put event, apply some business logic on the incoming data, then load into target.
For example, a new file (contactemail.json) created in source s3 bucket contain 2 components: email and domain. There's another persisted file (lkp.json) in the same s3 bucket contain a list of all free email domains (e.g. gmail.com). The lambda function read the contactemail.json file, lookup to lkp.json file based on domain. If the domain in contactemail.json exist in lkp.json, put the whole email address into a new component (newdomain) in the contactemail.json file, then upload the output to target s3 bucket.
The following is my code. It does the work, however, as you can see, I use s3_client.download_file to download the lkp.json file before performing the lookup.
My concern is, if the lookup file is too big, maybe the download process will take too long and result in the timeout of the lambda function.
Is there a better/smarter way to do the lookup without the need of download lookup file from s3 to lambda?
from __future__ import print_function
import boto3
import os
import sys
import uuid
import json
s3_client = boto3.client('s3')
def handler(event, context):
#get source details from event
for record in event['Records']:
sourcebucket = record['s3']['bucket']['name']
sourcekey = record['s3']['object']['key']
sourcefilename = sourcekey[sourcekey.rfind('/')+1:]
lookupkey = 'json/contact/lkp/lkp.json'
lookupfilename = 'lkp.json'
#set target based on source value
targetbucket = sourcebucket + 'resized'
targetkey = sourcekey
targetfilename = sourcefilename
#set download and upload path in lambda
download_path = '/tmp/{}'.format(uuid.uuid4())+sourcefilename
download_path_lkp = '/tmp/{}'.format(uuid.uuid4())+lookupfilename
upload_path = '/tmp/{}'.format(uuid.uuid4())+targetfilename
#download source and lookup
s3_client.download_file(sourcebucket, sourcekey, download_path)
s3_client.download_file(sourcebucket, lookupkey, download_path_lkp)
#if not os.path.exists(upload_path):
# open(upload_path, 'w').close()
targetfile = open(upload_path, 'w')
sourcefile = json.loads(open(download_path).read())
lookupfile = json.loads(open(download_path_lkp).read())
lookuplist = []
for row in lookupfile:
lookuplist.append(row["domain"])
targetfile.write('[')
firstrow = True
for row in sourcefile:
email = row["email"]
emaildomain = email[email.rfind('@')+1:]
if (emaildomain in lookuplist):
row["newdomain"]=email
else:
row["newdomain"]=emaildomain
if (firstrow==False):
targetfile.write(',\n')
else:
firstrow=False
json.dump(row, targetfile)
targetfile.write(']')
targetfile.close()
#upload to target
s3_client.upload_file(upload_path, targetbucket, targetkey)
Upvotes: 2
Views: 3550
Reputation: 179364
Simply stated, S3 is not the correct service to use for this purpose.
It is not possible to look inside an object stored in S3 without downloading it.¹
Objects are the atomic entity in S3 -- there is nothing S3 understands that is smaller than an object, such as a "record" inside an object.
It is also not possible to append data to an object in S3. You have to download it, modify it, and upload it again, and if more than one process attempts this in parallel, at least one process will silently lose data, because there is no way to lock an S3 object FOR UPDATE
(a little SQL lingo, there). The second process reads the original object, modifies it, and proceeds to overwrite the changes that the first process saved right after the second process read the object.
As a "think outside the box" person, I will be the first to assert that there is a valid use case for S3 as a simple, perfunctory NoSQL database -- it is, after all, a key/value store with infinite storage and fast lookups by key... but the applications where it is suited to this role are limited. That isn't what it was designed for.
It seems in this case that you would be better served by a different architecture... although, if you connect your lambda function to VPC and create a VPC endpoint for S3 or use a NAT instance (not a NAT gateway, which has bandwidth charges), you could do 100,000 downloads for $0.04, so depending on your scale, downloading the file repeatedly might not be the worst thing ever... but you're going to waste a lot of billable lambda milliseconds repeatedly parsing the same file and scanning it, and as you already know, this will only get slower as your application grows. It seems like RDS, DynamoDB, or SimpleDB might be a better fit here.
You could also cache the content or at least the specific lookup results in memory, in an object outside the scope of "handler" ... right? (Not a python person but it seems plausible). Lambda will reuse the same process some of the time, depending on the workload and frequency of invocation.
¹ yes, you can do a byte-range read without downloading the entire object, but that isn't applicable here, since we need to scan, not seek.
Upvotes: 2