Reputation: 2739
I could not find in Airflow docs how to set up the retention policy I need. Currently, all airflow logs have to be manually deleted, else they will be kept forever on our servers... Not the best way to go.
I wish to create global logs configurations for all the different logs I have.
How and where do I configure:
Upvotes: 4
Views: 7335
Reputation: 1847
Theres no configuration in airflow.cfg to make this act like Apache Kafka, Until Airflow decide to add a config for log retention policy, I usually try not to make things in Airflow complicated to maintain so I pick this easy route so Systems team or Data Analysts in the team can maintain it with less of a senior developer experience.
Usually you go for deleting logs or obsolete DAG_RUN deletion to save space or make Airflow load dags faster and for that you need to make sure Airflow's integrity is not harmed(learned the hard way), while you make unorthodox changes.
Logs in Airflow can be in 3 places, backend DB, log folder(dag logs, scheduler logs, etc) , remote location(not needed in 99% of times).
Make sure to delete old DAG RUNS first and in Database, mine is Postgres and bellow SQL has a purpose and thats to keep me latest 10 "runs" and delete the rest.
Step 1: Deleting data in Backend database(to make airflow faster in loads)
WITH RankedDags AS (
SELECT id,
ROW_NUMBER() OVER (PARTITION BY dag_id ORDER BY execution_date DESC) AS rn
FROM public.dag_run
WHERE (state = 'success' OR state = 'failed')
)
DELETE FROM public.dag_run
WHERE id IN (
SELECT id
FROM RankedDags
WHERE rn > 10
);
You can also pick a date and instead of above, use the result of a select query like bellow to only delete the old ones, I usually dont do this as I have dags that run each year or month and I want to know how those looked in their first run:
SELECT *
FROM public.dag_run f
WHERE (f.state = 'success' OR f.state = 'failed')
AND DATE(f.execution_date) <= CURRENT_DATE - INTERVAL '15 days';
Step 2: remove the scheduler logs(these are the logs that waste so much space and dont worry, nothing is gonna happen) Just dont delete the folder that has 'latest' shortcut
root@airflow-server:~/airflow/logs/scheduler# ll
drwxr-xr-x 3 root root 4096 Sep 24 20:00 2024-09-25
drwxr-xr-x 3 root root 4096 Sep 25 20:00 2024-09-26
drwxr-xr-x 3 root root 4096 Sep 26 20:00 2024-09-27
drwxr-xr-x 3 root root 4096 Sep 30 10:57 2024-09-30
drwxr-xr-x 7 root root 4096 Oct 31 20:00 2024-11-01
lrwxrwxrwx 1 root root 10 Oct 31 20:00 latest -> 2024-11-01
# rm -rf 2024-09-*
Now you have at least 80% of your logs deleted and must be satisfied, but if you want to go further you can write a bash script to traverse through your /root/airflow/logs/dag_id* to find folders or files within those that have old modified data. even if you past the Step 1 and 2 and you delete the dirs mentioned you only lose the details of logs within each task instance logs.
Also you can take measures like changing all log levels to 'ERROR' inside airflow.cfg to lighten the app.
You can always make above steps into an ETL to run automatically, but as disk is cheap and 30GB disk can easily store more than 10,000 complex dag_runs with heavy spark logs, you really just need to spend 30 minutes every other month to clean the scheduler logs.
Upvotes: 0
Reputation: 21
Starting from apache-airflow 2.6.0 , you can set the below value in airflow.cfg in the logging section .
delete_local_logs = True
For this to work , you should enable remote logging , which pushes the log files to a remote S3 bucket or something similar .Airflow automatically pushes the logs to the configured remote folder and deletes the local files. You can set a BucketLifecycleConfiguration on the S3 bucket based on your log retention period to delete the old log files in the bucket automatically .
Go through the logging documentation from airflow docs for more information:
Upvotes: 0
Reputation: 4863
I ran into the same situation yesterday, the solution for me was to use a DAG that handles all the log cleanup and schedule it as any other DAG.
Check this repo, you will find a step-by-step guide on how to set it up. Basically what you will achieve is to delete files located on airflow-home/log/
and airflow-home/log/scheduler
based on a given period defined on a Variable
. The DAG dynamically creates one task for each directory targeted for deletion based on your previous definition.
In my case, the only modification I made to the original DAG was to allow deletion only to the scheduler folder by replacing the initial value of DIRECTORIES_TO_DELETE
. All credits to the creators! works very well out of the box, and it's easy to customize.
Upvotes: 2