Reputation: 57
I need to use Pandas in an airflow job. Even though I am an experienced programmer, I am relatively new to Python. I want to know in my requirements.txt
, do I install pandas from PyPI or apache-airflow[pandas]
.
Also, I am not entirely sure what the provider apache-airflow[pandas]
does? And how does pip resolve it (it seems like it is not in PyPi.
Thank you in advance for the answers.
apache-airflow[pandas]
Upvotes: 1
Views: 1830
Reputation: 15961
I suggest to install Airflow with constraints as explained in the docs:
pip install "apache-airflow[pandas]==2.5.1" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.5.1/constraints-3.7.txt"
this will guarantee stable installation of Airflow without conflicts. Airflow also updates the constraints when release is cut thus when you upgrade Airflow you will get the latest possible version that "agrees" with all other Airflow dependencies.
For example:
Airflow 2.5.1 with Python 3.7 the version is:
pandas==1.3.5
Airflow 2.5.1 with Python 3.9 the version is:
pandas==1.5.2
Personally, I don't recommend overriding the versions in constraints. It carry a risk that your production environment will not be stable/consistent (unless you implement your own mechanism to generate constraints). Should you have a specific task that requires other version of a library (pandas or other) then I suggest using PythonVirtualenvOperator, DockerOperator or any other alternative that allows you to set specific libraries version for this task. This also gives DAG author the freedom to set whatever library version they need without being depended on other teams that share the same Airflow instance and need other versions for the same library, or even the same team but with another project that needs different versions (think of it the same way as you manage virtual environments in your IDE).
As for your question about apache-airflow[pandas]
. Note that this is extra dependency it's not Airflow provider as you mentioned. The reason for having it is because Airflow had dependency on pandas in the past (as part of Airflow core) however pandas is heavy library and not everyone needs it thus moving it to optional dependency makes sense. That way only users who need to have pandas in their Airflow environment will install it.
Upvotes: 0
Reputation: 3074
apache-airflow[pandas]
only installs pandas>=0.17.1
: https://github.com/apache/airflow/blob/0d2555b318d0eb4ed5f2d410eccf20e26ad004ad/setup.py#L308-L310. For context, this was the PR that originally added it: https://github.com/apache/airflow/pull/17575.
Since >=0.17.1
is quite broad, I suggest limiting Pandas to a more specific version in your requirements.txt
. This gives you more control over the Pandas version, instead of the large number of possible Pandas versions that Airflow limits itself to.
Upvotes: 3