How to get the list of files in the GCS Bucket using the Jupyter notebook in Dataproc?

Question

I have recently started using GCP for my project and encountered difficulties when working with the bucket from the Jupyter notebook in the Dataproc cluster. At the moment I have a bucket with a bunch of files in it, and a Dataproc cluster with the Jupyter notebook. What I am trying to do is go over all the files in the bucket and extract the data from them to create a dataframe.

I can access one file at a time with the following code: data = spark.read.csv('gs://BUCKET_NAME/PATH/FILENAME.csv'), but there are hundreds of files, and I cannot write a line of code for each of them. Usually, I would do something like this:

import os
for filename in os.listdir(directory):
...

but this does not seem to work here. So, I was wondering, how do I iterate over files in a bucket using Jupyter notebook in the Dataproc cluster?

Would appreciate any help!

Javier A · Accepted Answer

You can list the elements in your bucket with the following commands:

from google.cloud import storage
client = storage.Client()
BUCKET_NAME = 'your_bucket_name'
bucket = client.get_bucket(BUCKET_NAME)
elements = bucket.list_blobs()
files=[a.name for a in elements]

If there are no folders in your bucket, the list called files will contain the names of the files.

How to get the list of files in the GCS Bucket using the Jupyter notebook in Dataproc?

Answers (1)

Related Questions