Reputation: 16505
We are planning to use bigquery for analytical purpose for our inventory system. Since this is inventory, a record of storeid-productid combination is going to change very often. In terms of volume, the total store-product records are somewhere between 200M - 400M. In total 500K mutations per day are expected. The mutations are coming in kafka topics.
From cost standpoint, what's the optimum solution. Options are
A kafka listener issues a DML statement. UPDATE inventory SET quantity=? WHERE productid=? AND storeid=?
. => My assessment on this option is, This is simplest of all, but may incur higher cost because bigquery doesn't have a notion of primary key. Will search index/ clustering etc help?
Have a staging table where we store every mutation, then periodically, using MERGE
update the main\reporting table
Something like this https://cloud.google.com/blog/products/bigquery/performing-large-scale-mutations-in-bigquery (However this is a 2018 article, things might have changed a lot - for example, I think the 3 hour lag mentioned here is now 30 minutes)
MERGE dataset.Inventory T
USING dataset.inventory_staging S
ON T.ProductID = S.ProductID and T.storeid = S.storeid
WHEN MATCHED THEN
UPDATE SET quantity = s.quantity
WHEN NOT MATCHED THEN
INSERT (ProductID, quantity) VALUES (ProductID, quantity)
Now the second question, if we are to take the second approach,
What's the cost effective way to sink a kafka topic to big query. Does Kafka -> GCS -> BQ give any advantage over streaming solutions (like a boilerplate kafkalistener that does https://cloud.google.com/bigquery/docs/write-api#write-api-overview)
Upvotes: 2
Views: 788
Reputation: 670
For the first question, single UPDATE statements are a no-go, scheduled MERGES have some trade-offs in data availability, so another option is to store all the mutations and version the data so that you can always recover the last value.
Then, you can schedule a query to delete or archive old versions by removing them from the source table. This will improve data scanning.
Regarding your second question, the fastest way to solve this would be:
Kafka > PubSub sink > Bigquery.
Just be aware of the limitations and specific transformations that will happen at each step.
https://cloud.google.com/pubsub/docs/connect_kafka#convert-to-pubsub
https://cloud.google.com/pubsub/docs/bigquery
Upvotes: 0
Reputation: 6552
Running one UPDATE statement per item would be crazy expensive, you need to have the stage table and run periodical MERGEs.
Kafka -> GCS -> BQ is the most cost effective way.
As additional suggestion you may explore creating a topic in Pub/Sub that replaces kafka. Also Pub / Sub has direct ingestion to bigquery.
Upvotes: 2