Reputation: 525
For majority of use-cases, Spark transformations can be done on streaming data or bounded data (say from Amazon S3) using Amazon EMR, and then data can be written to S3 again with the transformed data.
The transformations can also be achieved in Amazon Redshift using the different data from S3 being loaded to different Redshift tables, and then the data from the different Redshift tables loaded to final table. (Now with Redshift spectrum, we could also select and transform data directly from S3 as well.)
With that said, I see the transformations can be done in both EMR and Redshift, with Redshift loads and transformations done with less development time.
So, should EMR be used for use-cases mainly involving streaming/unbounded data? What other use-cases is EMR preferable (I am aware Spark provides other core, sql, ml libraries as well), but just for transformation(involving joins/reducers) to be achieved, I don't see a use-case other than streaming inside EMR, when transformation can be achieved also in Redshift.
Please provide use-cases when to use EMR transformations vs Redshift transformation.
Upvotes: 2
Views: 3730
Reputation: 4354
In the first instance I prefer to use Redshift for transformations as:
Sometimes EMR is a better option, I would consider it in these circumstances:
There are other additional options other than Redshift and EMR, thsese should also be considered. for example
AWS Batch and AWS lambda should also be considered.
Upvotes: 10