Reputation: 33
I'm working on an Airflow project where I have a KubernetesPodOperator task that sometimes fails but remains in a running state after passing number of defined retries. Status in UI is Running but no worker even up for it. This causes the task to potentially run for an extended period of time, even up to a day.
Here's a simplified version of my DAG:
with DAG(
dag_id='dbt',
dagrun_timeout=timedelta(hours=4),
start_date=datetime(2023, 1, 1),
schedule_interval="25 * * * *",
catchup=True,
max_active_runs=1,
) as dag:
dbt_test_model = KubernetesPodOperator(
task_id="dbt_test",
name="dbt-run-test",
cmds=["sh", "-c"],
arguments=["dbt source model_test"],
get_logs=True,
retries=3,
in_cluster=True,
is_delete_operator_pod=False,
pod_template_file=DBT_POD_TEMPLATE_PATH,
dag=dag
dbt_run_model = KubernetesPodOperator(
task_id="dbt_run",
name="dbt-run",
cmds=["sh", "-c"],
arguments=["dbt run"],
get_logs=True,
retries=3,
in_cluster=True,
is_delete_operator_pod=False,
pod_template_file=DBT_POD_TEMPLATE_PATH,
dag=dag
)
I've tried to debug the issue but haven't been able to find a solution. Has anyone encountered a similar issue or have any suggestions on how to handle this?
Upvotes: 1
Views: 34