Reputation: 979
I am in the process of moving an internal company tool written entirely in python to the AWS ecosystem, but was having issues figuring out the proper way to set up my data so that it stays organized. This tool is used by people throughout the company, with each person running the tool on their own datasets (that vary in size from a few megabytes to a few gigabytes in size). Currently, users clone the code to their local machines then run the tool on their data locally; we are now trying to move this usage to the cloud.
For a single person, it is simple enough to have them upload their data to s3, then point the python code to that data to run the tool, but I'm worried that as more people start using the tool, the s3 storage will become cluttered/disorganized.
Additionally, each person might make slight changes to the python tool in order to do custom work on their data. Our code is hosted in a bitbucket server, and users will be forking the repo for their custom work.
My questions are:
If anyone has any input as to how to set up this project, or has links to any relevant guides/documents, it would be greatly appreciated. Thanks!
Upvotes: 0
Views: 74
Reputation: 67988
You can do something like this.
a) A boto3 script to upload s3 data to specified bucket with maybe
timestamp appended to it.
b) Configure S3 bucket to send notification over SQS when a new item comes
c) Keep 2-3 EC2 machines running actively listening to SQS.
d) When a new item comes, it gets key from SQS.Process it.
Delete event from SQS after successful completion.
e) Put processed data in some place, delete the key from Bucket.
Notify user through mail.
For custom users, they can create a new branch and provide it in data uploaded and ec2 reads it from there and checks out the required branch.After the job the branch can be deleted. This can be a single line with branch name over it.This will involve one time set up.You probably should be using some process manager on EC2 which would restart process if it crashes.
Upvotes: 1