Reputation: 2167
We have around 10 TB of data from the customer which have to load and query using hive and create aggregation tables which again has to be queried multiple times.
I am planning to use AWS S3
to store 10 TB data in one bucket and query the data using EMR
.
Is it a feasible approach or the performance will be poor?
What alternatives can be used to speed up the query?
Upvotes: 1
Views: 629
Reputation: 78573
Yes, it's feasible. This is a very common use case (to use S3 vs. hydrating HDFS). The challenge with providing a definitive statement on performance is that "it depends". I think the performance per dollar is undeniably better with S3 but the straight-up performance, depending on how you organize data and what your interaction with that data looks like, is likely to be better with data locally (as you'd expect).
Here are some related articles on this topic:
Things to consider when optimizing access to data in S3:
Upvotes: 3