Hadoop Map/Reduce Job with Cloud Bigtable From my own Hadoop Cluster

Question

In this article Example: Hadoop Map/Reduce Job with Cloud Bigtable, it creates a Hadoop cluster in Google cloud and connects to the cloud bigtable cluster.

Even in this article, it uses a Connection object to communicate with the BigTable cluster. Does this mean that Google recommends us to use its customized HBase client APIs to access data on Cloud BigTable?

Is it possible to connect to the cloud bigtable cluster from my own Hadoop cluster? My Hadoop cluster is in AWS instead of Google cloud.

Angus Davis · Accepted Answer

Right now the best client for Java, and especially Hadoop MapReduce, is through the HBase APIs, yes. That said, the client targets the standard HBase 1.0+ Connection, Table, and Admin interfaces.

To connect from AWS, you can use the steps in Connecting to Cloud Bigtable and set extra configuration values to specify how to authenticate to Cloud Bigtable. At a minimum, you'll want to specify the email address of a service account (google.bigtable.auth.service.account.email) and the location of a p12 file for that service account (google.bigtable.auth.service.account.keyfile) defined in the BigtableOptionsFactory. The p12 key file and bigtable-hbase JARs will need to be distributed along with the job (or previously deployed to the cluster). Your workflow for deploying other jobs to your cluster will impact how these extra dependencies are deployed.

As for deploying, if you're using maven (or ivy, etc), the bigtable-hbase jars can be found on maven central under through the group com.google.cloud.bigtable. There are artifacts for both HBase 1.0 and 1.1. If you're using maven-assembly or maven-shade plugins, you can bundle the required artifacts with your jobs.

Also note that when running outside GCP / GCE, latency to and from Cloud Bigtable will increase due to extra network hops.

Hadoop Map/Reduce Job with Cloud Bigtable From my own Hadoop Cluster

Answers (1)

Related Questions