Reputation: 743
I've been trying to use Airflow to schedule a DAG. One of the DAG includes a task which loads data from s3 bucket.
For the purpose above I need to setup s3 connection. But UI provided by airflow isn't that intutive (http://pythonhosted.org/airflow/configuration.html?highlight=connection#connections). Any one succeeded setting up the s3 connection if so are there any best practices you folks follow?
Thanks.
Upvotes: 64
Views: 73000
Reputation: 1539
EDIT: This answer stores your secret key in plain text which can be a security risk and is not recommended. The best way is to put access key and secret key in the login/password fields, as mentioned in other answers below. END EDIT
It's hard to find references, but after digging a bit I was able to make it work.
Create a new connection with the following attributes:
Conn Id: my_conn_S3
Conn Type: S3
Extra:
{"aws_access_key_id":"_your_aws_access_key_id_", "aws_secret_access_key": "_your_aws_secret_access_key_"}
my_conn_S3
S3
{"aws_access_key_id":"_your_aws_access_key_id_", "aws_secret_access_key": "_your_aws_secret_access_key_"}
To use this connection, below you can find a simple S3 Sensor Test. The idea of this test is to set up a sensor that watches files in S3 (T1 task) and once below condition is satisfied it triggers a bash command (T2 task).
airflow webserver
.airflow scheduler
.The schedule_interval in the dag definition is set to '@once', to facilitate debugging.
To run it again, leave everything as it's, remove files in the bucket and try again by selecting the first task (in the graph view) and selecting 'Clear' all 'Past','Future','Upstream','Downstream' .... activity. This should kick off the DAG again.
Let me know how it went.
"""
S3 Sensor Connection Test
"""
from airflow import DAG
from airflow.operators import SimpleHttpOperator, HttpSensor, BashOperator, EmailOperator, S3KeySensor
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2016, 11, 1),
'email': ['[email protected]'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 5,
'retry_delay': timedelta(minutes=5)
}
dag = DAG('s3_dag_test', default_args=default_args, schedule_interval= '@once')
t1 = BashOperator(
task_id='bash_test',
bash_command='echo "hello, it should work" > s3_conn_test.txt',
dag=dag)
sensor = S3KeySensor(
task_id='check_s3_for_file_in_s3',
bucket_key='file-to-watch-*',
wildcard_match=True,
bucket_name='S3-Bucket-To-Watch',
s3_conn_id='my_conn_S3',
timeout=18*60*60,
poke_interval=120,
dag=dag)
t1.set_upstream(sensor)
Upvotes: 106
Reputation: 4048
We've added this to our docs a few versions ago:
http://airflow.apache.org/docs/stable/howto/connection/aws.html
There is no difference between an AWS connection and an S3 connection.
The accepted answer here has key and secret in the extra/JSON, and while that still works (as of 1.10.10) it is not recommended anymore as it displays the secret in plain text in the UI.
Upvotes: 10
Reputation: 1724
Conn Id: example_s3_connnection
Conn Type: S3
Extra:{"aws_access_key_id":"xxxxxxxxxx", "aws_secret_access_key": "yyyyyyyyyyy"}
Note: Login and Password fields are left empty.
Upvotes: 1
Reputation: 153
Another option that worked for me was to put the access key as the "login" and the secret key as the "password":
Conn Id: <arbitrary_conn_id>
Conn Type: S3
Login: <aws_access_key>
Password: <aws_secret_key>
Leave all other fields blank.
Upvotes: 15
Reputation: 26
For aws in China, It don't work on airflow==1.8.0 need update to 1.9.0 but airflow 1.9.0 change name to apache-airflow==1.9.0
Upvotes: 0
Reputation: 77
For the new version, change the python code on above sample.
s3_conn_id='my_conn_S3'
to
aws_conn_id='my_conn_s3'
Upvotes: 5
Reputation: 1784
Assuming airflow is hosted on an EC2 server.
just create the connection as per other answers but leave everything blank in the configuration apart from connection type which should stay as S3
The S3hook will default to boto and this will default to the role of the EC2 server you are running airflow on. assuming this role has rights to S3 your task will be able to access the bucket.
this is a much safer way than using and storing credentials.
Upvotes: 22
Reputation: 171
If you are worried about exposing the credentials in the UI, another way is to pass credential file location in the Extra param in UI. Only the functional user has read privileges to the file. It looks something like below
Extra: {
"profile": "<profile_name>",
"s3_config_file": "/home/<functional_user>/creds/s3_credentials",
"s3_config_format": "aws" }
file "/home/<functional_user>/creds/s3_credentials
" has below entries
[<profile_name>]
aws_access_key_id = <access_key_id>
aws_secret_access_key = <secret_key>
Upvotes: 17