cipri.l
cipri.l

Reputation: 819

Which is the best HBase connector to use for batch loading data into HBase from Spark?

As mentioned also in Which HBase connector for Spark 2.0 should I use? mainly there are two options:

I do understand the optimizations and the differences with regard to READING from HBase.

However it's not clear for me which should I use for BATCH inserting into HBase.

I am not interested in one by one records, but by high throughput.

After digging through code, it seems that both resort to TableOutputFormat, http://hbase.apache.org/1.2/book.html#arch.bulk.load

The project uses Scala 2.11, Spark 2, HBase 1.2

Does the DataFrame library provide any performance improvements over the RDD lib specifically for BULK LOAD ?

Upvotes: 1

Views: 2056

Answers (2)

Naveen Nelamali
Naveen Nelamali

Reputation: 1164

Lately, hbase-spark connector has been released to a new maven central repository with 1.0.0 version and supports Spark version 2.4.0 and Scala 2.11.12

  <dependency>
     <groupId>org.apache.hbase.connectors.spark</groupId>
     <artifactId>hbase-spark</artifactId>
     <version>1.0.0</version>
   </dependency>

This supports both RDD and DataFrames. Please refer spark-hbase-connectors for more details

Happy Learning !!

Upvotes: 4

Sachin Thapa
Sachin Thapa

Reputation: 3719

Have you looked at bulk load examples on Hbase project.

See Hbase Bulk Examples, github page have java examples, you can write scala code easily.

Also read Apache Spark Comes to Apache HBase with HBase-Spark Module

Given a choice RDD vs DataFrame, we should use DataFrame as per recommendation on official documentation.

A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.

Hoping this helps.

Cheers !

Upvotes: 2

Related Questions