Reputation: 2646
I'm trying to fetch file from Google Drive using Apache Beam. I tried,
filenames = ['https://drive.google.com/file/d/<file_id>']
with beam.Pipeline() as pipeline:
lines = (pipeline | beam.Create(filenames))
print(lines)
This returns a string like PCollection[[19]: Create/Map(decode).None]
I need to read a file from Google Drive and write it into GCS bucket. How can I read a file form G Drive from Apache beam?
Upvotes: 0
Views: 531
Reputation: 5104
If you want to use Beam for this, you would could write a function
def read_from_gdrive_and_yield_records(path):
...
and then use it like
filenames = ['https://drive.google.com/file/d/<file_id>']
with beam.Pipeline() as pipeline:
paths = pipeline | beam.Create(filenames)
records = paths | beam.FlatMap(read_from_gdrive_and_emit_records)
records | beam.io.WriteToText('gs://...')
Though as mentioned, unless you have a lot of files, this may be overkill.
Upvotes: 1
Reputation: 6572
If you don’t have complex transformations to apply, I thinks it’s better to not use Beam
in this case.
You can instead use Google Collab
(Juypiter Notebook on Google servers), mount your gDrive and use the gCloud CLI to copy files.
You can check the following links :
stackoverflow-copy-file-from-google-drive-to-gcs
You can also use APIs to retrieve files from Google Drive
and copy them to Cloud Storage
.
You can for example develop a Python
script using Python
Google clients and the following packages :
google-api-python-client
google-auth-httplib2
google-auth-oauthlib
google-cloud-storage
This article shows an example.
Upvotes: 1