Reputation: 971
I am attempting to implement a relatively simple ETL pipeline that iterates through files in a google cloud bucket. The bucket has two folders: /input and /output.
What I'm trying to do is write a Java/Scala script to iterate through files in /input, and have the transformation applied to those that are not present in /output or those that have a timestamp later than that in /output. I've been looking through the Java API doc for a function I can leverage (as opposed to just calling gsutil ls ...
), but haven't had any luck so far. Any recommendations on where to look in the doc?
Edit: There is a better way to do this than using data transfer objects:
public Page<Blob> listBlobs() {
// [START listBlobs]
Page<Blob> blobs = bucket.list();
for (Blob blob : blobs.iterateAll()) {
// do something with the blob
}
// [END listBlobs]
return blobs;
}
Old method:
def getBucketFolderContents(
bucketName: String
) = {
val credential = getCredential
val httpTransport = GoogleNetHttpTransport.newTrustedTransport()
val requestFactory = httpTransport.createRequestFactory(credential)
val uri = "https://www.googleapis.com/storage/v1/b/" + URLEncoder.encode(
bucketName,
"UTF-8") +
"o/raw%2f"
val url = new GenericUrl(uri)
val request = requestFactory.buildGetRequest(uri)
val response = request.execute()
response
}
}
Upvotes: 0
Views: 5101
Reputation: 12145
You can list objects under a folder by setting the prefix string on the object listing API: https://cloud.google.com/storage/docs/json_api/v1/objects/list The results of listing are sorted, so you should be able to list both folders and then walk through both in order and generate the diff list.
Upvotes: 1