SourceSimian
SourceSimian

Reputation: 730

Does EMRFS make S3 consistent for external clients

If I have a file in HFDS, or on the local filesystem, is it possible to copy this to S3 with EMRFS enabled, immediately shutdown the cluster, and have the file be guaranteed available for both listing and reading to external readers as soon as this copy operation is complete? Or is EMRFS only consistent within the specific EMR cluster it was enabled for? What would copying a file to S3 via EMRFS from HDFS look like? From the local filesystem?

Upvotes: 1

Views: 1646

Answers (2)

loneStar
loneStar

Reputation: 4010

EMRFS is a consistent view enabled for objects that are created by the EMR hadoop jobs .

1) The main purpose of EMRFS is, the objects created by the hadoop jobs are immediate consistent. So the files can be used for the next job if there is dependency.

2) The way you copy files to S3 to make imediate consistent to next hadoop jobs is , copy the file to hdfs, and then copy to s3 Commands using following.

  1. Hdfs dfs -put file.txt /user/hadoop/
  2. hdfs dfs -cp /user/hadoop/file.txt s3://bucket-name

List the files in s3 is very costly, if you want data to be immediate consistency on s3, you have to implement the index on s3. Following is the link to have files immediate consistent using dynamodb https://aws.amazon.com/blogs/big-data/building-and-maintaining-an-amazon-s3-metadata-index-without-servers/?

When you use s3 api then s3 consistency comes into picture because it will not goes through EMRFS metadata.

Upvotes: 1

SourceSimian
SourceSimian

Reputation: 730

I asked this same question on the AWS developer forum-- https://forums.aws.amazon.com/thread.jspa?threadID=257220&tstart=25 -- this contains a lot of valuable detail and it provides a much better overview of EMRFS than all the EMRFS documentation combined IMO, but I will provide a crash summary of the crash summary:

1) Consistent view is a feature which must be explicitly enabled in the EMRFS configuration, otherwise you only have S3 consistency guarantees.

2) EMRFS Consistent View only takes effect within clusters which share the same EMRFS configuration-- it has no effect on external clients accessing S3 normally

3) The only real consistency guarantee S3 provides is that a new file that has not been written before is guaranteed consistent for reads, but not for listing. So if a client specifically asks for a file by path that it knows has been newly-created, it will always get it, but it may or may not get the path of the file in a list operation, and if the file previously existed there is no guarantee which version the client will get on a read operation.

Upvotes: 1

Related Questions