Best practices for setting up a data pipeline on AWS? (Lambda/EMR/Redshift/Athena)

Question

*Disclaimer: *This is my first time ever posting on stackoverflow, so excuse me if this is not the place for such a high-level question.

I just started working as a data scientist and I've been asked to set up an AWS environment for 'external' data. This data comes from different sources, in different formats (although its mostly csv/xlsx). They want to store it on AWS and be able to query/visualize it with Tableau.

Despite my lack of AWS experience I managed to come up with a solution that's more or less working. This is my approach:

Raw csv/xlsx are grabbed using a Lambda
Data is cleaned and transformed using pandas/numpy in the same Lambda as 1.
The processed data is written to S3 folders as CSV (still within the same lambda)
Athena is used to index the data
Extra tables are created using Athena (some of which are views, others aren't)
Athena connector is setup for Tableau

It works but it feels like a messy solution: the queries are slow and lambdas are huge. Data is often not as normalized as it could be, since it increases query time even more. Storing as CSV also seems silly

I've tried to read up on best practices, but it's a bit overwhelming. I've got plenty questions, but it boils down to: What services should I be using in a situation like this? What does the high-level architecture look like?

Best practices for setting up a data pipeline on AWS? (Lambda/EMR/Redshift/Athena)

Answers (1)

Related Questions