How to set up AWS pipeline for data project with multiple users

Question

I am in the process of moving an internal company tool written entirely in python to the AWS ecosystem, but was having issues figuring out the proper way to set up my data so that it stays organized. This tool is used by people throughout the company, with each person running the tool on their own datasets (that vary in size from a few megabytes to a few gigabytes in size). Currently, users clone the code to their local machines then run the tool on their data locally; we are now trying to move this usage to the cloud.

For a single person, it is simple enough to have them upload their data to s3, then point the python code to that data to run the tool, but I'm worried that as more people start using the tool, the s3 storage will become cluttered/disorganized.

Additionally, each person might make slight changes to the python tool in order to do custom work on their data. Our code is hosted in a bitbucket server, and users will be forking the repo for their custom work.

My questions are:

Are S3 and EC2 the only AWS tools needed to support a project like this?
What is the proper way for users to upload their data, run the code, then download their results so that the data stays organized in S3?
What are the best practices for using EC2 in a situation like this? Do people usually spin up a new EC2 for each job or is scheduling multiple jobs on a single EC2 more efficient?
Is there a way to automate the data uploading process that will allow users to easily run the code on their data without needing to know how to code?

If anyone has any input as to how to set up this project, or has links to any relevant guides/documents, it would be greatly appreciated. Thanks!

How to set up AWS pipeline for data project with multiple users

Answers (1)

Related Questions