Reputation:
I am have an apache-beam==2.3.0 pipeline written using the python SDK that is working with my DirectRunner locally. When I change the runner to DataflowRunner I get an error about 'storage' not being global.
Checking my code I think it's because I am using the credentials stored in my environment. In my python code I just do:
class ExtractBlobs(beam.DoFn):
def process(self, element):
from google.cloud import storage
client = storage.Client()
yield list(client.get_bucket(element).list_blobs(max_results=100))
The real issue is that I need the client so I can then get the bucket so I can then list the blobs. Everything I'm doing here is so I can list the blobs.
So if anyone can either point me the right direction towards using 'storage.Client()' in Dataflow or how to list the blobs of a GCP bucket without needing the client.
Thanks in advance! [+] What I've read: https://cloud.google.com/docs/authentication/production#auth-cloud-implicit-python
Upvotes: 0
Views: 754
Reputation:
Fixed: Okay so upon further reading and investigating it turns out I have the required libraries to run my pipeline locally but Dataflow needs to know these in order to download them into the resources it spins up. https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/
so all I've done is create a requirements.txt file with my google-cloud-* requirements.
I then spin up my pipeline like this:
python myPipeline.py --requirements_file requirements.txt --save_main_session True
that last flag is to tell it to keep the imports you do in main.
Upvotes: 1