Revan
Revan

Reputation: 551

How to read and write data in Google Cloud Bigtable in PySpark application?

I am using Spark on a Google Cloud Dataproc cluster and I would like to access Bigtable in a PySpark job. Do we have any Bigtable connector for Spark like Google BigQuery connector?

How can we access Bigtable from a PySpark application?

Upvotes: 5

Views: 4726

Answers (1)

Patrick Clay
Patrick Clay

Reputation: 1349

Cloud Bigtable is usually best accessed from Spark using the Apache HBase APIs.

HBase currently only provides Hadoop MapReduce I/O formats. These can be accessed from Spark (or PySpark) using the SparkContext.newAPIHadoopRDD methods. However converting the records into something usable in Python is difficult.

HBase is developing Spark SQL APIs, but these have not been integrated in a released version. Hortonworks has a Spark HBase Connector, but it compiles against Spark 1.6 (which requires Cloud Dataproc version 1.0) and I have not used it, so I cannot speak to how easy it is to use.

Alternatively you could use a Python based Bigtable client, and simply use PySpark for parallelism.

Upvotes: 6

Related Questions