mongolol
mongolol

Reputation: 971

Iterate through Files in Google Cloud Bucket

I am attempting to implement a relatively simple ETL pipeline that iterates through files in a google cloud bucket. The bucket has two folders: /input and /output.

What I'm trying to do is write a Java/Scala script to iterate through files in /input, and have the transformation applied to those that are not present in /output or those that have a timestamp later than that in /output. I've been looking through the Java API doc for a function I can leverage (as opposed to just calling gsutil ls ...), but haven't had any luck so far. Any recommendations on where to look in the doc? Edit: There is a better way to do this than using data transfer objects:

  public Page<Blob> listBlobs() {
    // [START listBlobs]
    Page<Blob> blobs = bucket.list();
    for (Blob blob : blobs.iterateAll()) {
      // do something with the blob
    }
    // [END listBlobs]
    return blobs;
  }

Old method:

  def getBucketFolderContents(
      bucketName: String
  ) = {
    val credential = getCredential
    val httpTransport = GoogleNetHttpTransport.newTrustedTransport()
    val requestFactory = httpTransport.createRequestFactory(credential)
    val uri = "https://www.googleapis.com/storage/v1/b/" + URLEncoder.encode(
      bucketName,
      "UTF-8") +
      "o/raw%2f"
    val url = new GenericUrl(uri)
    val request = requestFactory.buildGetRequest(uri)
    val response = request.execute()
    response

  }
}

Upvotes: 0

Views: 5101

Answers (1)

Mike Schwartz
Mike Schwartz

Reputation: 12145

You can list objects under a folder by setting the prefix string on the object listing API: https://cloud.google.com/storage/docs/json_api/v1/objects/list The results of listing are sorted, so you should be able to list both folders and then walk through both in order and generate the diff list.

Upvotes: 1

Related Questions