Reputation: 371
Doc says:
Hooks are interfaces to external platforms and databases like Hive, S3, MySQL, Postgres, HDFS, and Pig. Hooks implement a common interface when possible, and act as a building block for operators. Ref
But why do we need them?
I want to select data from one Postgres DB, and store to another one. Can I use, for example, psycopg2 driver inside python script, which runs by a python operator, or airflow should know for some reason what exactly I'm doing inside script, so, I need to use PostgresHook instead of just psycopg2 driver?
Upvotes: 14
Views: 4553
Reputation: 6259
You don't 'need' them, they are provided for you by Airflow as a convenience. Also, they allow you manage connection information in a single place.
Airflow also supply a bunch of them out-of-the-box so that you don't need to write code for basic operations. If you wanted to use the s3/postgres/mysql clients directly you can still do this with hooks by using a hook's get_conn
method.
There is no real reason not to use hooks, especially because it is extremely easy to create your own which can extend the provided ones.
Upvotes: 0
Reputation: 497
While it is possible to just hardcode the connections in your script and run it, the power of hooks will allow to edit environment variables from within the UI.
Have a look at "Automate AWS Tasks Thanks to Airflow Hooks" to learn a bit more about how to use hooks.
Upvotes: 4
Reputation: 2352
You should use just PostresHook. Instead of using psycopg2 as so:
conn = f'{pass}:{server}@host etc}'
cur = conn.cursor()
cur.execute(query)
data = cur.fetchall()
You can just type:
postgres = PostgresHook('connection_id')
data = postgres.get_pandas_df(query)
Which can also make use of encryption of connections.
So using hooks is cleaner, safer and easier.
Upvotes: 7