Reputation: 1
*Disclaimer: *This is my first time ever posting on stackoverflow, so excuse me if this is not the place for such a high-level question.
I just started working as a data scientist and I've been asked to set up an AWS environment for 'external' data. This data comes from different sources, in different formats (although its mostly csv/xlsx). They want to store it on AWS and be able to query/visualize it with Tableau.
Despite my lack of AWS experience I managed to come up with a solution that's more or less working. This is my approach:
It works but it feels like a messy solution: the queries are slow and lambdas are huge. Data is often not as normalized as it could be, since it increases query time even more. Storing as CSV also seems silly
I've tried to read up on best practices, but it's a bit overwhelming. I've got plenty questions, but it boils down to: What services should I be using in a situation like this? What does the high-level architecture look like?
Upvotes: 0
Views: 351
Reputation: 1039
I have a fairly similar use-case; however, it all comes down to the size of the project and how for you want to take robustness / future planning of the solution.
As a first iteration, what you have described above seems like it works and is a reasonable approach but as you pointed out is quite basic and clunky. If the external data is something you will be consistently ingestion and can foresee growing i would strongly suggest you design a datalake system first, my recommendation would be either use AWS lake formation service or if you want more control, and build ground up, use something like the 3x3x3 approach.
By designing your datalake correctly managing the data in the future becomes much simpler and nicely partitions your files for future use / data diving.
As a high level architecture would be something like:
then,
I would also recommend converting your files into parquet if your using Athena or equivalent for improved query speed. Remember file partitioning is important to performance!
Note, the above is for quite a robust ingestion system and may be overkill if you have a basic use case with infrequent data ingestion.
If your data is in small packets but is very frequent you could even use a kinesis layer in-front of the lambda to s3 step to pipe your data in a more organised manner. You could also use redshift to host your files instead of S3 if you wanted a more contemporary warehouse solution. However, if you have x sources i would suggest stick with s3 for simplicity.
Upvotes: 1