Reputation: 41
We are using Airflow 2.0.1 with following settings:
Firstly, we setup an own job that renewed the Kerberos tickets of run_as_user users. For about week it worked fine, then one of worker started to fail with missing Kerberos ticket. We could not find any changes between last successful run and the failing jobs; there was a valid ticket on the node; so we stopped the worker. Next day, we restarted the Airflow completely and missing Kerberos ticket was reported by all the workers. Temporarily, we are able to run the jobs with one worker when the kinit is run inside the DAGs and going to enable Kerberos according to https://airflow.apache.org/docs/apache-airflow/2.0.1/security/kerberos.html?highlight=kerberos.
The questions would be
Upvotes: 3
Views: 3588
Reputation: 41
Thanks for the comments. I did some tests and think understand a bit better now what happened to us:
Access to hdfs from BashOperator started to fail with Kerberos error and i thought i must set configuration according to https://airflow.apache.org/docs/apache-airflow/2.0.1/security/kerberos.html?highlight=kerberos. When testing it, the worker was trying to read by run_as_user user ticket cache file configured in airflow.cfg and created by airflow kerberos and it didnt have rights to it...
Therefore, i was checking where the tickets are stored on production and they are in /tmp of the service which runs the Airflow worker. I mean the tickets created by kinit run in BashOperators in a DAG, but our script that was renewing the tickets did ssh to all the nodes where Airflow workers run, so the tickets were stored in regular server /tmp.
The fails started some days after we launched Airflow in services. We have still a lot of kinits directly in our DAGs (as it was the way we used before, because before we used queues). With launching workers on several nodes without queues we also set up the script for renewing the ticktes (doing ssh...) to be sure the tickets are renewed on all nodes. I think the Kerberos issues occurred just after restart of services, because the tickets created by kinits in DAGs were deleted. If it is like that, the Kerberos issues would have occured also when we launched the services for the first time, but we had also other issues, so maybe we didnt notice.
Upvotes: 1