Reputation: 119
I am writing a consumer that consumes (user activity data, (activityid, userid, timestamp, cta, duration)
from Google Pub/Sub and I want to create a sink for this such that I can train my ML model in online fashion.
Since this sink is the source from where I will get the user's last x (say 100) activity, to update the ml model, if I can store the data in user-sharded form (in say a no-sql db, bigtable), retrieval will be easy, but the update operation will be costly, as I will append to the value every time I get the activity event for the user, which type of sink should I consider in this situation?
Upvotes: 1
Views: 63
Reputation: 119
Using the bigtable cell_version, and have set garbage collection such that, saving last 100 cell version, while re-training /updating the ML model, iterating over the historical cell versions.
Will update the final read / write throughput and latencies
Upvotes: 1