Oleg Ivanytskyi
Oleg Ivanytskyi

Reputation: 1091

How to get the list of files in the GCS Bucket using the Jupyter notebook in Dataproc?


I have recently started using GCP for my project and encountered difficulties when working with the bucket from the Jupyter notebook in the Dataproc cluster. At the moment I have a bucket with a bunch of files in it, and a Dataproc cluster with the Jupyter notebook. What I am trying to do is go over all the files in the bucket and extract the data from them to create a dataframe.

I can access one file at a time with the following code: data = spark.read.csv('gs://BUCKET_NAME/PATH/FILENAME.csv'), but there are hundreds of files, and I cannot write a line of code for each of them. Usually, I would do something like this:

import os
for filename in os.listdir(directory):
...

but this does not seem to work here. So, I was wondering, how do I iterate over files in a bucket using Jupyter notebook in the Dataproc cluster?

Would appreciate any help!

Upvotes: 3

Views: 3658

Answers (1)

Javier A
Javier A

Reputation: 569

You can list the elements in your bucket with the following commands:

from google.cloud import storage
client = storage.Client()
BUCKET_NAME = 'your_bucket_name'
bucket = client.get_bucket(BUCKET_NAME)
elements = bucket.list_blobs()
files=[a.name for a in elements]

If there are no folders in your bucket, the list called files will contain the names of the files.

Upvotes: 7

Related Questions