Reputation: 41

when to use BigTableIO.write() API vs CloudBigtableIO.CloudBigtableSingleTableBufferedWriteFn API

I see two APIs available for writing to GCP BigTable : BigTableIO.write() vs CloudBigtableIO.CloudBigtableSingleTableBufferedWriteFn. I am working on a Dataflow pipeline to read data from one BigTable and store processed output in another BigTable. The processing load is large - 10TB.

I would like to know which API to use based on the requirement for large dataset write to BigTable.

Upvotes: 1

Answers (2)

Yi Hu

Reputation: 96

Answer as of Aug 2023 (from internal Q&A, share for better visibility)

Technically they are two different implementations.

Apache Beam's BigtableIO uses Cloud Bigtable Protos for storing read results and mutations.
CloudBigtableIO uses HBase Result and Puts.

Since both writes are done via batched mutations, there is no preference of one over another for high volume writes to single table. It should not matter. User can keep using the existing implementation. Both are used quite heavily and are actively maintained.

There is a long time goal to update Beam's HBaseIO to be compatible (most importantly, scalable) with Bigtable's hbase adaptor. Some recent updates of HBaseIO (around Beam 2.46.0 onwards) reflects this effort.

Upvotes: 0

Catherine O

Reputation: 1011

Based on your use case, it is highly recommended to use the CloudBigTableIO APIs since you're writing a Dataflow pipeline. BigtableIO class is from the Beam SDK, while CloudBigtableIO class is from Google.

Some of CloudBigTableIO advantages include:

Well documented HBase API
More efficient when reading really large tables
More efficient for Pub/Sub as a source
Easy creation of custom DoFns

Upvotes: 1

when to use BigTableIO.write() API vs CloudBigtableIO.CloudBigtableSingleTableBufferedWriteFn API

Answers (2)

Related Questions