Reputation: 11607
As a part of my Spark
pipeline, I have to perform following tasks on EMR
/ S3
:
S3 bucket
S3 bucket
Based on my current knowledge, Airflow
doesn't provide operator
s / hook
s for these tasks. I therefore plan to implement them as follows:
S3Hook
to add a function that performs aws s3 rm
on specified S3 bucket
SSHExecuteOperator
to perform hadoop distcp
My questions are:
Airflow
?I'm using:
Airflow 1.9.0
[Python 3.6.6
] (will upgrade to Airflow 1.10
once it is released)EMR 5.13.0
Upvotes: 0
Views: 1494
Reputation: 6259
Well the delete
is a primitive operation yes but not the hadoop distcp
. To answer your questions:
Not sure why the standards S3_Hook does not have a delete function. It MAY be because s3 provides an "eventually consistent" Consistency Model (probably not the reason but good to keep in mind anyway)
Upvotes: 1