Reputation: 436
I'm trying to run the official Dataflow example here:https://github.com/GoogleCloudPlatform/dataflow-prediction-example
However, the Dataflow job is not able to start correctly (and the same error is happening with my other jobs too), due to the following type of error in the logs:
(happens 2nd) Could not install packages due to an EnvironmentError:
[Errno 2] No such file or directory: '/usr/local/lib/python2.7/dist-packages/tensorflow-1.9.0.dist-info/METADATA'
(happens 1st) Successfully built tensorflow-module
I followed the directions on Github exactly, and here is the output of pip freeze
of the virtualenv for this example:
absl-py==0.4.0
apache-beam==2.6.0
astor==0.7.1
avro==1.8.2
backports.weakref==1.0.post1
cachetools==2.1.0
certifi==2018.8.13
chardet==3.0.4
crcmod==1.7
dill==0.2.8.2
docopt==0.6.2
enum34==1.1.6
fasteners==0.14.1
funcsigs==1.0.2
future==0.16.0
futures==3.2.0
gapic-google-cloud-pubsub-v1==0.15.4
gast==0.2.0
google-apitools==0.5.20
google-auth==1.5.1
google-auth-httplib2==0.0.3
google-cloud-bigquery==0.25.0
google-cloud-core==0.25.0
google-cloud-pubsub==0.26.0
google-gax==0.15.16
googleapis-common-protos==1.5.3
googledatastore==7.0.1
grpc-google-iam-v1==0.11.4
grpcio==1.14.1
hdfs==2.1.0
httplib2==0.11.3
idna==2.7
Markdown==2.6.11
mock==2.0.0
monotonic==1.5
numpy==1.14.5
oauth2client==4.1.2
pbr==4.2.0
ply==3.8
proto-google-cloud-datastore-v1==0.90.4
proto-google-cloud-pubsub-v1==0.15.4
protobuf==3.6.1
pyasn1==0.4.4
pyasn1-modules==0.2.2
pydot==1.2.4
pyparsing==2.2.0
pytz==2018.4
PyVCF==0.6.8
PyYAML==3.13
requests==2.19.1
rsa==3.4.2
six==1.11.0
tensorboard==1.10.0
tensorflow==1.10.0
termcolor==1.1.0
typing==3.6.4
urllib3==1.23
Werkzeug==0.14.1
This pip dependency issue happened for all the other jobs that I tried, so I decided to try the official github example, and it's happening for this one too.
This job id is: 2018-08-15_23_42_57-394561747688459326
, and I'm using Python 2.7.
Thanks for the help, and any pointers!
Upvotes: 1
Views: 2518
Reputation: 8178
As explained in the Apache Beam documentation about how to handle Python dependencies in a pipeline, the recommended approach for PyPI dependencies is to create a requirements.txt
file and then pass it as an optional command-line option like below (which may have been the mistake when you experimented this issue):
--requirements_file requirements.txt
In any case, as I can seen in the latest sample on how to run Apache Beam with TensorFlow, what the code does is actually to pass the list of packages to be installed as the install_requires options in the setuptools
, so this is also an option that you can follow, and which I see that solved your issue.
Upvotes: 2
Reputation: 436
I actually got around to solving this issue by removing my requirements.txt
file, and posting the very few additional libraries that my app was using in my setup.py
file (discarding the dependencies already provided in the Dataflow workers - https://cloud.google.com/dataflow/docs/concepts/sdk-worker-dependencies#version-250_1).
Nevertheless, I'm not exactly sure if this is the right solution, since the Github example itself only worked once I removed the pip install tensorflow
command from it's setup.py
file.
Hope this helps someone! :)
Upvotes: 1