Google Dataflow: Import custom Python module

Question

I try to run a Apache Beam pipeline (Python) within Google Cloud Dataflow, triggered by a DAG in Google Cloud Coomposer.

The structure of my dags folder in the respective GCS bucket is as follows:

/dags/
  dataflow.py <- DAG
  dataflow/
    pipeline.py <- pipeline
    setup.py
    my_modules/
      __init__.py
      commons.py <- the module I want to import in the pipeline

The setup.py is very basic, but according to the Apache Beam docs and answers on SO:

import setuptools

setuptools.setup(setuptools.find_packages())

In the DAG file (dataflow.py) I set the setup_file option and pass it to Dataflow:

default_dag_args = {
    ... ,
    'dataflow_default_options': {
        ... ,
        'runner': 'DataflowRunner',
        'setup_file': os.path.join(configuration.get('core', 'dags_folder'), 'dataflow', 'setup.py')
    }
}

Within the pipeline file (pipeline.py) I try to use

from my_modules import commons

but this fails. The log in Google Cloud Composer (Apache Airflow) says:

gcp_dataflow_hook.py:132} WARNING - b'  File "/home/airflow/gcs/dags/dataflow/dataflow.py", line 11
    from my_modules import commons
           ^
SyntaxError: invalid syntax'

The basic idea behind the setup.py file is documented here

Also, there are similar questions on SO which helped me:

Google Dataflow - Failed to import custom python modules

Dataflow/apache beam: manage custom module dependencies

I'm actually wondering why my pipelines fails with a Syntax Error and not a module not found kind of error...

Google Dataflow: Import custom Python module

Answers (1)

Related Questions