Reputation: 83
I am trying to test my dataflow pipeline on the DataflowRunner. My code always gets stuck at 1 hr 1min and says: The Dataflow appears to be stuck. When digging through the stack trace of the Dataflow stackdriver, I come across the error saying the Failed to install packages: failed to install workflow: exit status 1
. I saw other stack overflow messages saying that this can be caused when pip packages are not compatible. This is causing my worker startup to always fail.
This is my current setup.py. Can someone please help me understand what I am missing. The job id is 2018-02-09_08_22_34-6196858167817670597.
setup.py
from setuptools import setup, find_packages
requires = [
'numpy==1.14.0',
'google-cloud-storage==1.7.0',
'pandas==0.22.0',
'sqlalchemy-vertica[pyodbc,turbodbc,vertica-python]==0.2.5',
'sqlalchemy==1.2.2',
'apache_beam[gcp]==2.2.0',
'google-cloud-dataflow==2.2.0'
]
setup(
name="dataflow_pipeline_dependencies",
version="1.0.0",
description="Beam pipeline for flattening ism data",
packages=find_packages(),
install_requires=requires
)
Upvotes: 4
Views: 4296
Reputation: 680
Your mileage may vary, but for me, none of the above worked (Python 3.7).
Instead, the solution seemed to be to have my dependencies in a requirements.txt
file and then everything else in setup.py
. It was important that I not load requirements.txt
lines into the install_requires
property. Any way I did it, including workflow
or not, having install_requires
seemed to lead me to this error.
Instead, my setup.py
simply does not specify dependencies at all. I gave both the --requirements_file
and --setup_file
arguments when running the pipeline. That solved the issue for me, and there was a noticeable difference in how the pipeline built and launched, as the dependencies were stored in the staging location this way, whereas before they were not.
For example:
setup.py
import setuptools
setuptools.setup(
name='my_pipeline',
version='0.0.0',
packages=setuptools.find_packages()
)
requirements.txt
google-cloud-bigquery==1.24.0
google-cloud-storage==1.25.0
jinja2==2.11.1
[...etc...]
run_pipeline.sh
#!/usr/bin/env bash
[...code to set vars...]
if [ "${1}" = "dataflow" ]; then
RUNNER="--runner DataflowRunner"
fi
python "${PIPELINE_FILE}" \
--output "${OUTPUT}" \
--project myproject \
--region us-west1 \
--temp_location "${TEMP}" \
--staging_location "${STAGING}" \
--no_use_public_ips \
--requirements_file requirements.txt \
--setup_file "./setup.py" \
${RUNNER}
Upvotes: 0
Reputation: 488
Include 'workflow' package in the setup.py
required packages. Error is solved after including it.
from setuptools import setup, find_packages
requires = [
'numpy==1.14.0',
'google-cloud-storage==1.7.0',
'pandas==0.22.0',
'sqlalchemy-vertica[pyodbc,turbodbc,vertica-python]==0.2.5',
'sqlalchemy==1.2.2',
'apache_beam[gcp]==2.2.0',
'google-cloud-dataflow==2.2.0',
'workflow' # Include this line
]
setup(
name="dataflow_pipeline_dependencies",
version="1.0.0",
description="Beam pipeline for flattening ism data",
packages=find_packages(),
install_requires=requires
)
Upvotes: 3
Reputation: 83
So I have figured out that workflow is not a pypi package in this case, but actually the name of the .tar that is created by Dataflow which contains the source code. Dataflow will compress your source code and create a workflow.tar file in your staging environment, then it will try to run pip install workflow.tar. If any issues comes up from this install, it will fail to install the packages onto the workers.
My issue was resolved by a few things: 1) I added six==1.10.0 to my requires, as I found from : Workflow failed. Causes: (35af2d4d3e5569e4): The Dataflow appears to be stuck , that there is an issue with the latest version of six. 2) I realized that sqlalchemy-vertica and sqlalchemy are out of sync and have issues with dependency versions. I hence removed my need for both and found a different vertica client.
Upvotes: 2
Reputation: 1261
I am no genius with dealing with a lot of Python packages and how to manage all the versions, incompatibilities and needs and wants of every one.
However, I can read error messages.
In your case the message says "failed to install workflow". After a quick Google search I found that "workflow" actually is a Python package.
So the error is simply complaining that you haven't installed workflow
and that it's attempt to do so failed.
To fix this problem:
workflow
from this PyPI link. This is the latest version that Google showed me.Or
pip install workflow
.Either method should install the required package. Once that is installed that particular error message should go away.
I hope this answer helped you!
Upvotes: 0