user1414202
user1414202

Reputation: 436

Google Cloud Dataflow (Python) - Not installing dependencies correctly

I'm trying to run the official Dataflow example here:https://github.com/GoogleCloudPlatform/dataflow-prediction-example

However, the Dataflow job is not able to start correctly (and the same error is happening with my other jobs too), due to the following type of error in the logs:

    (happens 2nd) Could not install packages due to an EnvironmentError: 
    [Errno 2] No such file or directory: '/usr/local/lib/python2.7/dist-packages/tensorflow-1.9.0.dist-info/METADATA' 
    (happens 1st) Successfully built tensorflow-module 

I followed the directions on Github exactly, and here is the output of pip freeze of the virtualenv for this example:

    absl-py==0.4.0
    apache-beam==2.6.0
    astor==0.7.1
    avro==1.8.2
    backports.weakref==1.0.post1
    cachetools==2.1.0
    certifi==2018.8.13
    chardet==3.0.4
    crcmod==1.7
    dill==0.2.8.2
    docopt==0.6.2
    enum34==1.1.6
    fasteners==0.14.1
    funcsigs==1.0.2
    future==0.16.0
    futures==3.2.0
    gapic-google-cloud-pubsub-v1==0.15.4
    gast==0.2.0
    google-apitools==0.5.20
    google-auth==1.5.1
    google-auth-httplib2==0.0.3
    google-cloud-bigquery==0.25.0
    google-cloud-core==0.25.0
    google-cloud-pubsub==0.26.0
    google-gax==0.15.16
    googleapis-common-protos==1.5.3
    googledatastore==7.0.1
    grpc-google-iam-v1==0.11.4
    grpcio==1.14.1
    hdfs==2.1.0
    httplib2==0.11.3
    idna==2.7
    Markdown==2.6.11
    mock==2.0.0
    monotonic==1.5
    numpy==1.14.5
    oauth2client==4.1.2
    pbr==4.2.0
    ply==3.8
    proto-google-cloud-datastore-v1==0.90.4
    proto-google-cloud-pubsub-v1==0.15.4
    protobuf==3.6.1
    pyasn1==0.4.4
    pyasn1-modules==0.2.2
    pydot==1.2.4
    pyparsing==2.2.0
    pytz==2018.4
    PyVCF==0.6.8
    PyYAML==3.13
    requests==2.19.1
    rsa==3.4.2
    six==1.11.0
    tensorboard==1.10.0
    tensorflow==1.10.0
    termcolor==1.1.0
    typing==3.6.4
    urllib3==1.23
    Werkzeug==0.14.1

This pip dependency issue happened for all the other jobs that I tried, so I decided to try the official github example, and it's happening for this one too.

This job id is: 2018-08-15_23_42_57-394561747688459326, and I'm using Python 2.7.

Thanks for the help, and any pointers!

Upvotes: 1

Views: 2518

Answers (2)

dsesto
dsesto

Reputation: 8178

As explained in the Apache Beam documentation about how to handle Python dependencies in a pipeline, the recommended approach for PyPI dependencies is to create a requirements.txt file and then pass it as an optional command-line option like below (which may have been the mistake when you experimented this issue):

--requirements_file requirements.txt

In any case, as I can seen in the latest sample on how to run Apache Beam with TensorFlow, what the code does is actually to pass the list of packages to be installed as the install_requires options in the setuptools, so this is also an option that you can follow, and which I see that solved your issue.

Upvotes: 2

user1414202
user1414202

Reputation: 436

I actually got around to solving this issue by removing my requirements.txt file, and posting the very few additional libraries that my app was using in my setup.py file (discarding the dependencies already provided in the Dataflow workers - https://cloud.google.com/dataflow/docs/concepts/sdk-worker-dependencies#version-250_1).

Nevertheless, I'm not exactly sure if this is the right solution, since the Github example itself only worked once I removed the pip install tensorflow command from it's setup.py file.

Hope this helps someone! :)

Upvotes: 1

Related Questions