Performance of AWS EMR over S3 compared to Server with harddisk storage

Question

We have around 10 TB of data from the customer which have to load and query using hive and create aggregation tables which again has to be queried multiple times.

I am planning to use AWS S3 to store 10 TB data in one bucket and query the data using EMR.

Is it a feasible approach or the performance will be poor?

What alternatives can be used to speed up the query?

jarmod · Accepted Answer

Yes, it's feasible. This is a very common use case (to use S3 vs. hydrating HDFS). The challenge with providing a definitive statement on performance is that "it depends". I think the performance per dollar is undeniably better with S3 but the straight-up performance, depending on how you organize data and what your interaction with that data looks like, is likely to be better with data locally (as you'd expect).

Here are some related articles on this topic:

Things to consider when optimizing access to data in S3:

Performance of AWS EMR over S3 compared to Server with harddisk storage

Answers (1)

Related Questions