aviral sanjay
aviral sanjay

Reputation: 983

Getting no module named pandas error in airflow even after pandas getting successfully installed

How to resolve the error no module named pandas when one node (in Airflow's DAG) is successful in using it(pandas) and the other is not?

I am unable to deduce as to why I am getting an error no module named pandas.

I have checked via pip3 freeze and yes, the desired pandas version does show up.

I have deployed this using docker on a kubernetes cluster.

Upvotes: 3

Views: 2414

Answers (2)

devatherock
devatherock

Reputation: 4961

In my case, I was running airflow with docker compose with a custom docker image that installed additional packages using pip install.

Dockerfile:

FROM apache/airflow:slim-2.7.2-python3.11
ADD airflow/requirements.txt .
RUN pip install -r requirements.txt

docker-compose.yml:

services:
  airflow:
    build: .
    environment:
      AIRFLOW__CORE__EXECUTOR: SequentialExecutor
      AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
      AIRFLOW__CORE__LOAD_EXAMPLES: 'false'
      AIRFLOW__CORE__DAGS_FOLDER: /data/dags
      _AIRFLOW_DB_MIGRATE: 'true'
      _AIRFLOW_WWW_USER_CREATE: 'true'
      _AIRFLOW_WWW_USER_USERNAME: airflow
      _AIRFLOW_WWW_USER_PASSWORD: airflow
      AIRFLOW__WEBSERVER__EXPOSE_CONFIG: 'true'
      AIRFLOW__SECRETS__BACKEND: airflow.secrets.local_filesystem.LocalFilesystemBackend
      AIRFLOW__SECRETS__BACKEND_KWARGS: '{"variables_file_path": "/data/variables.yml", "connections_file_path": "/data/connections.yml"}'
    ports:
      - '8080:8080'
      - '8793:8793'
      - '8794:8794'
    volumes:
      - ./airflow:/data
    command: 'standalone'

I was originally getting the ModuleNotFoundError: No module named 'pandas' error at DAG import time. That error went away when I installed the pandas package by adding the line pandas==2.1.1 to the requirements.txt file. But then I started getting the same error on the same DAG, but only when executing the DAG. Though on first glance the error looked like the same one as before, on closer look the error was for a different package

ModuleNotFoundError: No module named 'pandas'
During handling of the above exception, another exception occurred:
...
Exception: pandas library not installed, run: pip install 'apache-airflow-providers-common-sql[pandas]'.

The runtime error went away when I installed apache-airflow-providers-common-sql[pandas] by adding the line apache-airflow-providers-common-sql[pandas]==1.7.2 to the requirements.txt. I did have to remove the pandas==2.1.1 line from the requirements.txt as it looked like when both apache-airflow-providers-common-sql[pandas] and pandas were specified, there was a version mismatch and hence the ModuleNotFoundError: No module named 'pandas' error at runtime still persisted.

Final requirements.txt:

apache-airflow-providers-postgres==5.6.1
apache-airflow-providers-common-sql[pandas]==1.7.2

Upvotes: 0

dlamblin
dlamblin

Reputation: 45341

Pandas is generally required, and sometimes used in some hooks to return dataframes. Well, it's possible that Airflow was installed with pip and not pip3 possibly being added as a Python 2 module and not a Python 3 module (though, using pip should have installed Pandas when one looks at the setup.py).

Which Operator in your DAG is giving this error? Do you have any PythonVirtualEnvironmentOperators or BashOperators running python from the command line (and thus possibly not sharing the same environment that you're checking has pandas)?

Upvotes: 1

Related Questions