Reputation: 5209
I'm starting a new project which requires a data warehouse, for which we'll be using postgres. (Shameless plug: swarm64 makes postgres a great DW option for datasets up to terabytes large)
I'm using apache airflow to orchestrate the workloads, but as I'm new to airflow, I'm not sure what the best practice is for the application's DB needs.
For some more context, I'm using airflow's docker-compose.yml, and I'm also an airflow newbie.
Noticing that the docker-compose already defines a postgres db:
...
postgres:
image: postgres:13
environment:
POSTGRES_USER: airflow
POSTGRES_PASSWORD: airflow
POSTGRES_DB: airflow
volumes:
- postgres-db-volume:/var/lib/postgresql/data
healthcheck:
test: ["CMD", "pg_isready", "-U", "airflow"]
interval: 5s
retries: 5
restart: always
...
I'm immediately wondering if it would be a better idea to add another postgres service or if to configure the existing one to have 2 users, 2 databases...
Eventually, I'll move this project to a cloud, and will probably use a AWS postgres RDS or similar.
My question then is:
Upvotes: 2
Views: 918
Reputation: 15911
Airflow doesn't care what is your DWH you will be able to interact with it using Hooks and Operators. Many of them are available as Providers to Airflow and you can always write custom ones if needed. You need to separate between the Airflow backend metadata db (which can be PostgreSQL, MySQL) and you analytical storage where you store your processed data which can be anything you want (PostgreSQL, MySQL, S3, BigQuery, and many others).
Do NOT make Airflow backend database also your analytical database even if they both are PostgreSQL!
As for your questions the answer is:
Use plain regular PostgreSQL/MySQL for your Airflow installation.
Upvotes: 2