Reputation: 5726
I have a background service which produces files in Google Cloud Storage. Once it is done it generates a file in the output folder.
In my flow I need to get the list of these files and start DataProc Spark job with the list of files. The processing is not real-time and takes tens of minutes.
GCS has a notifications system. It can stream the notification to Pub/Sub service.
In GCS there will be a file .../feature/***/***.done
created to identify the service job completion.
Once the file is created the notification gets to Pub/Sub service.
I believe I can write Cloud Function that would read this notification, by some magic will get the location of the modified file and will be able to list all files from that folder. Then publish another message to Pub/Sub with all the required information
Ideally, it would be great to use Jobs instead of Streaming to reduce costs. This may mean that PubSub initiates Job instead of streaming Job pulls the new message from PubSub
Upvotes: 2
Views: 1580
Reputation: 146
Question 1: Can I subscribe to new files in GCS by wildcard?
You can set up GCS notifications to filter by path prefix. See the -p option here. Cloud Pub/Sub also has a filtering-by-attribute API in the Beta. You can use it to filter by the attributes set by GCS. The filtering language supports exact matches and prefix checks on the attributes set by GCS.
The message published to the Cloud Pub/Sub topic will have attributes that give you the bucket and name of the object, so you should be able to easily read other files in that bucket/subpath.
Question 2: Is it possible to start a DataProc job by Pub/Sub notification?
Yes, you can set up a Cloud Function to subscribe from your Cloud Pub/Sub topic. The function can then start up a DataProc cluster using the DataProc client library, or do any other action.
Upvotes: 1