I have the following BigTable structure for an example: Table1 : column_family_1 : column_1 : value The value here is a number let's say. This is managed by a dataflow and I want to update the value every time. This value might be an amount and I want to update it every time user makes a purchase (to maintain total spent till date), so I am doing the following in purchase event listener dataflow (whenever a purchase event is encountered): Make a BigTable request to get the value by id Add the amount spent in new purchase to the one present in BigTable search response Make a Put request to update the value Although this approach has some network latency, it seems to work. The scenario where this fails is, when there are multiple workers of a dataflow, user makes more than one purchase and the events go to multiple workers, e.g.: Worker 1 gets event 1, fetches the amount and adds the spent amount into it Worker 2 gets event 2, fetches the old amount and adds the spent amount into it Both workers make Put requests and they get overwritten To prevent this, I am trying to make a request which just says in plain text, add 10 to the spent amount value . Is this something we can do in dataflow?

google-cloud-dataflowapache-beambigtablegoogle-cloud-bigtable

Reputation: 30809

Google Cloud BigTable : Update the column value

I have the following BigTable structure for an example:

Table1 : column_family_1 : column_1 : value

The value here is a number let's say. This is managed by a dataflow and I want to update the value every time.

This value might be an amount and I want to update it every time user makes a purchase (to maintain total spent till date), so I am doing the following in purchase event listener dataflow (whenever a purchase event is encountered):

Make a BigTable request to get the value by id
Add the amount spent in new purchase to the one present in BigTable search response
Make a Put request to update the value

Although this approach has some network latency, it seems to work. The scenario where this fails is, when there are multiple workers of a dataflow, user makes more than one purchase and the events go to multiple workers, e.g.:

Worker 1 gets event 1, fetches the amount and adds the spent amount into it
Worker 2 gets event 2, fetches the old amount and adds the spent amount into it
Both workers make Put requests and they get overwritten

To prevent this, I am trying to make a request which just says in plain text, add 10 to the spent amount value. Is this something we can do in dataflow?

Upvotes: 0

Answers (3)

Lstm168

Reputation: 1

Maybe you can try add each transaction updates as separate column, using timestamps as qualifies, so total amount spent is simply summing all the columns? Periodically you can compact the N columns into one timestamp and the update can be atomic.

Upvotes: 0

Roberto Tena

Reputation: 61

Another solution could be the following:

To have one append-only table with the individual events, and an aggregated table with the relevant key and values to aggregate by.
Use AbstractCloudBigtableTableDoFn to do a Put insert into the append-only table.
In a sequential stage in the same dataflow, window events during a reasonable short period (5 seconds?) to group by the aggregation key.
In the next stage/s of that dataflow (or at this point it could even be routed to another dataflow, really), perform a range scan per grouped key in the append-only table, aggregate the values and do a Put insert into that aggregated table.

That way:

Ensuring that the events are in Bigtable when a window is triggered after the AbstractCloudBigtableTableDoFn, a consistent aggregate can be obtained.
In the unlikely circumstance of one window finishing after the next triggered window (with an old view of the aggregation), the trigger timestamp can be used to differentiate the latest version of the cells.
The append-only table could also be used for other kind/level of aggregations that could be decided to do afterwards.
The maintenance is relatively simple, as it would not need any periodic clean job.

Upvotes: 2

Solomon Duskis

Reputation: 2711

Bigtable has the capability to Increment values. You can see more details in the protobuf documentation.

Idempotency plays an important role in understanding counters in Bigtable. In Bigtable, the Puts are generally idempotent, which means that you can run them multiple times and always get the same result (a=2 will produce the same result no matter how many times you run it). Increments are not idempotent, since running them multiple times will produce different results (a++, a++ has a different result than a++, a++, a++).

Transient failures may or may not perform the Increment. It's never clear from the client-side if the Increment succeeds during those transient errors.

This Increment feature is complicated to build in Dataflow because of this idempotency. Dataflow has a concept of "bundles" which is a set of actions that act as a unit of work. Those bundles are retried for transient failures (you can read more about Dataflow transient failure retries here). Dataflow treats that "bundle" as a unit, bug Cloud Bigtable has to treat each individual item in the "bundle" as a distinct transaction, since Cloud Bigtable does not support multi-row transactions.

Given the mismatch in the expected behavior of "bundles", Cloud Bigtable will not allow you to run Increments via Dataflow.

The options you have deserve more documentation than what I can provide here, but I can provide some options at a high level:

Always use Put for any new event you find, and sum up the values on Reads. You can also write another job that does periodic clean up of rows by creating a "transaction" that deletes all current values, and writes a new cell with the sum
Use Cloud Functions which listens to Pub/Sub events and performs Increments. Here's a Cloud Bigtable example using Cloud Functions. You can also perform a Get, perform the addition and do a CheckAndMutate with the algorithm you describe in your post (I personally would opt for CheckAndMutate for consistency, if I were to choose this option).
Use AbstractCloudBigtableTableDoFn to write your own DoFn that performs Increments, or CheckAndMutate, but with the understanding that this may cause data integrity problems.

If the system is large enough, option #1 is your most robust option, but comes at the cost of system complexity. If you don't want that complexity, the option #2 is your next best bet (although I would opt for CheckAndMutate). If you don't care about having data integrity and need high throughput (like "page counts" or other telemetry where it's ok to be wrong a small fraction of the time), then option #3 is going to be your best bet.

Upvotes: 5

Google Cloud BigTable : Update the column value

Answers (3)

Related Questions