Bommu
Bommu

Reputation: 237

How can I export HDFS file in S3 to local machine as CSV file using Airflow

I want to create an airflow job to export hdfs file stored in S3 to local machine. Which airflow operator could be used for this

Upvotes: 1

Views: 977

Answers (1)

Nick_Kh
Nick_Kh

Reputation: 5243

There is no particular Airflow operator that can fully satisfy your needs, however as for mine I see two options(ways) how to potentially address this:

  1. A basic approach, using AWS CLI util, invoking cp command in Airflow BashOperator that is leveraging Bash instruments to copy target S3 file to the local destination, this method was already discussed in this Stack thread but in a bit different scenario.
  2. Besides Operators, Airflow contains a flexible mechanism called Hooks that widely extends Operators functionality by implementing communication channel to the external platforms. There is S3_hook module that aims to afford AWS S3 related operations, originally based on AWS boto3 library. However, probably you will not find the suitable method through the list of contains, that you might be interesting in. But recently I've discovered S3_to_hive_operator, after inspecting the entire structure and source code, I've found execute() Python function that triggers boto3 download_fileobj() method, downloading file from S3 bucket to local drive. Therefore, you can adopt custom Airflow Operator, supplying it with partially modified execute() function in the particular S3_hook method.

Hope it can be helpful for you research.

Upvotes: 2

Related Questions