Amazon EMR vs Amazon Redshift

Question

For majority of use-cases, Spark transformations can be done on streaming data or bounded data (say from Amazon S3) using Amazon EMR, and then data can be written to S3 again with the transformed data.

The transformations can also be achieved in Amazon Redshift using the different data from S3 being loaded to different Redshift tables, and then the data from the different Redshift tables loaded to final table. (Now with Redshift spectrum, we could also select and transform data directly from S3 as well.)

With that said, I see the transformations can be done in both EMR and Redshift, with Redshift loads and transformations done with less development time.

So, should EMR be used for use-cases mainly involving streaming/unbounded data? What other use-cases is EMR preferable (I am aware Spark provides other core, sql, ml libraries as well), but just for transformation(involving joins/reducers) to be achieved, I don't see a use-case other than streaming inside EMR, when transformation can be achieved also in Redshift.

Please provide use-cases when to use EMR transformations vs Redshift transformation.

Jon Scott · Accepted Answer

In the first instance I prefer to use Redshift for transformations as:

Development is easier, SQL rather than Spark
Maintenance / monitoring is easier
Infrastructure costs are lower assuming you can run during "off-peak" times.

Sometimes EMR is a better option, I would consider it in these circumstances:

When you want to have raw and transformed data both on S3, e.g. a "data lake" strategy
Complex transformations are required. Some transformations are just not possible using Redshift such as when
- managing complex and large json columns
- pivoting of data dynamically (variable number of attributes)
- Third party libraries are required
data sizes are so large that a much bigger redshift cluster would be needed to process the transformations.

There are other additional options other than Redshift and EMR, thsese should also be considered. for example

Standard python or other scripting language to :
- create dynamic transformation sql, which can be run in redshift
- processing from csv to parquet or similar
- scheduling (e.g. airflow)
AWS Athena
- can be used with s3 (e.g. parquet) input and output
- uses SQL (so some advantages in development time) using Presto syntax which in some cases is more powerful than Redshift SQL
- can have significant cost benefits as no permanent infrastructe costs are needed, pay on usage.

AWS Batch and AWS lambda should also be considered.

Amazon EMR vs Amazon Redshift

Answers (1)

Related Questions