Reputation: 720
My Google Dataflow job runs locally with local runner but fails building its package for running the pipeline with the DataflowRunner
. I am having this issue on apache-beam[gcp]==2.6.0
the same pipeline works on apache-beam[gcp]==2.4.0
My code works with the DirectRunner
locally without any problem and building the package python setup.py sdist --formats=tar
and installing that pip install dist/my-package.tar
works as well.
The job fails with the error message:
Failed to install packages: failed to install workflow: exit status 1
This error is thrown after the following info logs which seem to indicate the system numpy in the dataflow container is missing METADATA
Could not install packages due to an EnvironmentError: [Errno 2] No such file or directory: '/usr/local/lib/python2.7/dist-packages/numpy-1.14.5.dist-info/METADATA'
Failed to report setup error to service: could not lease work item to report failure (no work items returned)
Based on the above numpy error I installed numpy 1.14.5
which fixed my issue. I am still facing an issue of being unable to debug package setup as the exact way Dataflow builds its containers is quite opaque.
My issue is not with my setup.py
as otherwise the sdist
build shouldn't have worked. Dataflow's Docker image build process doesn't match dataflow.gcr.io/v1beta3/python:2.6.0
as that image doesn't have numpy nor beam installed in it. With the lack of reproducible docker builds debugging workflows is difficult.
Some context around my workflow setup code:
I install neuralcoref
library from https://github.com/huggingface/neuralcoref-models/releases/download/en_coref_lg-3.0.0/en_coref_lg-3.0.0.tar.gz with custom commands and the rest of my setup.py
is:
...
REQUIRED_PACKAGES = [
'six==1.12.0',
'dill==0.2.9',
'apache-beam[gcp]==2.6.0',
'spacy==2.0.13',
'requests==2.18.4',
'unidecode==1.0.22',
'tqdm==4.23.3',
'lxml==4.2.1',
'python-dateutil==2.7.3',
'textblob==0.15.1',
'networkx==2.1',
'flashtext==2.7',
'annoy==1.12.0',
'ujson==1.35',
'repoze.lru==0.7',
'Whoosh==2.7.4',
'python-Levenshtein==0.12.0',
'fuzzywuzzy==0.16.0',
'attrs==19.1.0',
# 'scikit-learn==0.19.1',# preinstalled in dataflow
# 'pandas==0.23.0',# preinstalled in dataflow
# 'scipy==1.1.0',# preinstalled in dataflow
]
setuptools.setup(
name='myproject',
version='0.0.6',
description='my project',
install_requires=REQUIRED_PACKAGES,
packages=setuptools.find_packages(),
cmdclass={
# Command class instantiated and run during pip install scenarios.
'build': build,
'CustomCommands': CustomCommands,
}
)
My local requirements.txt
is:
six==1.12.0
apache-beam[gcp]==2.6.0
spacy==2.0.13
requests==2.18.4
unidecode==1.0.22
tqdm==4.23.3
lxml==4.2.1
python-dateutil==2.7.3
textblob==0.15.1
networkx==2.1
flashtext==2.7
annoy==1.12.0
ujson==1.35
repoze.lru==0.7
Whoosh==2.7.4
python-Levenshtein==0.12.0
fuzzywuzzy==0.16.0
attrs==19.1.0
scikit-learn==0.19.1
pandas==0.23.0
scipy==1.1.0
The full error message is:
{
insertId: "7107501484934866351:1025729:0:380041"
jsonPayload: {
line: "boot.go:145"
message: "Failed to install packages: failed to install workflow: exit status 1"
}
labels: {
compute.googleapis.com/resource_id: "7107501484934866351"
compute.googleapis.com/resource_name: "myjob-04170525-av5b-harness-0w5w"
compute.googleapis.com/resource_type: "instance"
dataflow.googleapis.com/job_id: "2019-04-17_05_25_10-4738638106522967260"
dataflow.googleapis.com/job_name: "myjob"
dataflow.googleapis.com/region: "us-central1"
}
logName: "projects/myproject/logs/dataflow.googleapis.com%2Fworker-startup"
receiveTimestamp: "2019-04-17T13:21:37.786576023Z"
resource: {
labels: {
job_id: "2019-04-17_05_25_10-4738638106522967260"
job_name: "myjob"
project_id: "myproject"
region: "us-central1"
step_id: ""
}
type: "dataflow_step"
}
severity: "CRITICAL"
timestamp: "2019-04-17T13:21:19.954714Z"
}
Upvotes: 1
Views: 1297
Reputation: 478
Are you trying to configure the version of beam in your setup.py? I don't believe that will work. The version of dataflow needs to match the version you are running the job from.
Each version of Beam has its own container on dataflow. The dataflow container for 2.6.0 can be pulled from here: dataflow.gcr.io/v1beta3/python:2.6.0 There is a significant difference between 2.4.0 and 2.6.0, including the verison of pip.
To help you debug further, please add a copy of your setup.py. It would also be useful to know what version of apache-beam is installed (from pip list
).
Upvotes: 1