games_bond
games_bond

Reputation: 41

when to use BigTableIO.write() API vs CloudBigtableIO.CloudBigtableSingleTableBufferedWriteFn API

I see two APIs available for writing to GCP BigTable : BigTableIO.write() vs CloudBigtableIO.CloudBigtableSingleTableBufferedWriteFn. I am working on a Dataflow pipeline to read data from one BigTable and store processed output in another BigTable. The processing load is large - 10TB.

I would like to know which API to use based on the requirement for large dataset write to BigTable.

Upvotes: 1

Views: 423

Answers (2)

Yi Hu
Yi Hu

Reputation: 96

Answer as of Aug 2023 (from internal Q&A, share for better visibility)

Technically they are two different implementations.

  • Apache Beam's BigtableIO uses Cloud Bigtable Protos for storing read results and mutations.

  • CloudBigtableIO uses HBase Result and Puts.

Since both writes are done via batched mutations, there is no preference of one over another for high volume writes to single table. It should not matter. User can keep using the existing implementation. Both are used quite heavily and are actively maintained.

There is a long time goal to update Beam's HBaseIO to be compatible (most importantly, scalable) with Bigtable's hbase adaptor. Some recent updates of HBaseIO (around Beam 2.46.0 onwards) reflects this effort.

Upvotes: 0

Catherine O
Catherine O

Reputation: 1011

Based on your use case, it is highly recommended to use the CloudBigTableIO APIs since you're writing a Dataflow pipeline. BigtableIO class is from the Beam SDK, while CloudBigtableIO class is from Google.

Some of CloudBigTableIO advantages include:

  • Well documented HBase API
  • More efficient when reading really large tables
  • More efficient for Pub/Sub as a source
  • Easy creation of custom DoFns

Upvotes: 1

Related Questions