Suitable Google Cloud data storage option for raw JSON events with auto-incrementing id

Question

I'm looking for an appropriate google data/storage option to use as a location to stream raw, JSON events into.

The events are generated by users in response to very large email broadcasts so throughput could be very low one moment and up to ~25,000 events per-second for short periods of time. The JSON representation for these events will probably only be around 1kb each

I want to simply store these events as raw and unprocessed JSON strings, append-only, with a separate sequential numeric identifier for each record inserted. I'm planning to use this identifier as a way for consuming apps to be able to work through the stream sequentially (in a similar manner to the way Kafka consumers track their offset through the stream) - this will allow me to replay the event stream from points of my choosing.

I am taking advantage of Google Cloud Logging to aggregate the event stream from Compute Engine nodes, from here I can stream directly into a BigQuery table or Pub/Sub topic.

BigQuery seems more than capable of handling the streaming inserts, however it seems to have no concept of auto-incrementing id columns and also suggests that its query model is best-suited for aggregate queries rather than narrow-result sets. My requirement to query for the next highest row would clearly go against this.

The best idea I currently have is to push into Pub/Sub and have it write each event into a Cloud SQL database. That way Pub/Sub could buffer the events if Cloud SQL is unable to keep up. My desire for an auto-identifier and possibly an datestamp column makes this feel like a 'tabular' use-case and therefore I'm feeling the NoSQL options might also be inappropriate

If anybody has a better suggestion I would love to get some input.

Suitable Google Cloud data storage option for raw JSON events with auto-incrementing id

Answers (1)

Related Questions