codebot
codebot

Reputation: 2646

Reading file from G Drive via Apache Beam

I'm trying to fetch file from Google Drive using Apache Beam. I tried,

filenames = ['https://drive.google.com/file/d/<file_id>']
with beam.Pipeline() as pipeline:
    lines = (pipeline | beam.Create(filenames))
print(lines)

This returns a string like PCollection[[19]: Create/Map(decode).None]

I need to read a file from Google Drive and write it into GCS bucket. How can I read a file form G Drive from Apache beam?

Upvotes: 0

Views: 531

Answers (2)

robertwb
robertwb

Reputation: 5104

If you want to use Beam for this, you would could write a function

def read_from_gdrive_and_yield_records(path):
    ...

and then use it like

filenames = ['https://drive.google.com/file/d/<file_id>']
with beam.Pipeline() as pipeline:
    paths = pipeline | beam.Create(filenames)
    records = paths | beam.FlatMap(read_from_gdrive_and_emit_records)
    records | beam.io.WriteToText('gs://...')

Though as mentioned, unless you have a lot of files, this may be overkill.

Upvotes: 1

Mazlum Tosun
Mazlum Tosun

Reputation: 6572

If you don’t have complex transformations to apply, I thinks it’s better to not use Beam in this case.

  • Solution 1 :

You can instead use Google Collab (Juypiter Notebook on Google servers), mount your gDrive and use the gCloud CLI to copy files.

You can check the following links :

google-drive-to-gcs

stackoverflow-copy-file-from-google-drive-to-gcs

  • Solution 2

You can also use APIs to retrieve files from Google Drive and copy them to Cloud Storage.

You can for example develop a Python script using Python Google clients and the following packages :

google-api-python-client 
google-auth-httplib2 
google-auth-oauthlib 
google-cloud-storage

This article shows an example.

Upvotes: 1

Related Questions