Pragmatic
Pragmatic

Reputation: 3127

Using AWS EMRFS in apache spark hosted on ec2

If I am running spark on ec2 (or in kubernetes), can I use s3/emrfs in place of hdfs? Is this production ready and does it use parallelism to read/process data from s3?

Thanks in advance

Upvotes: 2

Views: 553

Answers (2)

stevel
stevel

Reputation: 13480

EMR uses a closed source S3 connector with proprietary features "emrfs". You don't get to see the source, can't get support from anyone else and don't get to use it except when you run emr. For independent apps: the s3a connector is great but not a full replacement for HDFS

Upvotes: 1

Ged
Ged

Reputation: 18108

No, EMRFS is for EMR only, the easy way to make S3 look like part of HDFS. For EC2 you connect to S3, but that is less easy than with EMR. S3 is not tightly coupled to EC2. Yes, parallelism is applied but not according to MR data locality, worker and data node that is.

Upvotes: 2

Related Questions