Reputation: 13587
So, the message size limit is 10Mb.
I've been using Pub/Sub as both an input and output for data pipelines because of its low latency. The assumption here is that the Pub/Sub is the fastest mechanism on Google Cloud to pull data into a Compute Engine instance and push it out of this instance one (or few) data point at a time (not in batch manner). Then a Cloud Function with pub/sub push subscription writes the output to BigQuery.
99% of the data I process does not exceed 1MB. But there are some outliers with over 10MB size.
What can I do about it? Leverage some kind of compression? Write output to Cloud Storage instead of Pub/Sub? Maybe to a persistent SSD? I want to make sure that my compute instances are doing their job digesting one data point at a time and spitting the output with minimal time spent on pulling and pushing data and max time spent on transforming it.
Upvotes: 6
Views: 9662
Reputation: 75990
The safest and the most scalable way is to save the data to Cloud Storage and to only publish the file reference in PubSub, not the content. It's also the most cost efficient way.
You can also imagine compressing the data, if they are compressible. It could be fastest than using Cloud Storage, but not as scalable.
Upvotes: 12