user9773014
user9773014

Reputation:

Dataflow: storage.Client() import error or how to list all blobs of a GCP bucket

I am have an apache-beam==2.3.0 pipeline written using the python SDK that is working with my DirectRunner locally. When I change the runner to DataflowRunner I get an error about 'storage' not being global.

Checking my code I think it's because I am using the credentials stored in my environment. In my python code I just do:

    class ExtractBlobs(beam.DoFn):
        def process(self, element):
            from google.cloud import storage
            client = storage.Client() 
            yield list(client.get_bucket(element).list_blobs(max_results=100))

The real issue is that I need the client so I can then get the bucket so I can then list the blobs. Everything I'm doing here is so I can list the blobs.

So if anyone can either point me the right direction towards using 'storage.Client()' in Dataflow or how to list the blobs of a GCP bucket without needing the client.

Thanks in advance! [+] What I've read: https://cloud.google.com/docs/authentication/production#auth-cloud-implicit-python

Upvotes: 0

Views: 754

Answers (1)

user9773014
user9773014

Reputation:

Fixed: Okay so upon further reading and investigating it turns out I have the required libraries to run my pipeline locally but Dataflow needs to know these in order to download them into the resources it spins up. https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/

so all I've done is create a requirements.txt file with my google-cloud-* requirements.

I then spin up my pipeline like this:

    python myPipeline.py --requirements_file requirements.txt --save_main_session True 

that last flag is to tell it to keep the imports you do in main.

Upvotes: 1

Related Questions