Reputation: 41
I see two APIs available for writing to GCP BigTable : BigTableIO.write()
vs CloudBigtableIO.CloudBigtableSingleTableBufferedWriteFn
. I am working on a Dataflow pipeline to read data from one BigTable and store processed output in another BigTable. The processing load is large - 10TB.
I would like to know which API to use based on the requirement for large dataset write to BigTable.
Upvotes: 1
Views: 423
Reputation: 96
Answer as of Aug 2023 (from internal Q&A, share for better visibility)
Technically they are two different implementations.
Apache Beam's BigtableIO uses Cloud Bigtable Protos for storing read results and mutations.
CloudBigtableIO uses HBase Result and Puts.
Since both writes are done via batched mutations, there is no preference of one over another for high volume writes to single table. It should not matter. User can keep using the existing implementation. Both are used quite heavily and are actively maintained.
There is a long time goal to update Beam's HBaseIO to be compatible (most importantly, scalable) with Bigtable's hbase adaptor. Some recent updates of HBaseIO (around Beam 2.46.0 onwards) reflects this effort.
Upvotes: 0
Reputation: 1011
Based on your use case, it is highly recommended to use the CloudBigTableIO APIs since you're writing a Dataflow pipeline. BigtableIO
class is from the Beam SDK, while CloudBigtableIO
class is from Google.
Some of CloudBigTableIO advantages include:
Upvotes: 1