y2k-shubham
y2k-shubham

Reputation: 11607

S3 Delete & HDFS to S3 Copy

As a part of my Spark pipeline, I have to perform following tasks on EMR / S3:

  1. Delete: (Recursively) Delete all files / directories under a given S3 bucket
  2. Copy: Copy contents of a directory (subdirectories & files) to a given S3 bucket

Based on my current knowledge, Airflow doesn't provide operators / hooks for these tasks. I therefore plan to implement them as follows:

  1. Delete: Extend S3Hook to add a function that performs aws s3 rm on specified S3 bucket
  2. Copy: Use SSHExecuteOperator to perform hadoop distcp

My questions are:


I'm using:

Upvotes: 0

Views: 1494

Answers (1)

Simon D
Simon D

Reputation: 6259

Well the delete is a primitive operation yes but not the hadoop distcp. To answer your questions:

  1. No airflow does not have functions on the s3 hook to perform these actions.
  2. By creating your own plugin to extend the s3_hook and also using the ssh operator to perform the distcp is, in my opinion, a good way to do this.

Not sure why the standards S3_Hook does not have a delete function. It MAY be because s3 provides an "eventually consistent" Consistency Model (probably not the reason but good to keep in mind anyway)

Upvotes: 1

Related Questions