ArunDhaJ
ArunDhaJ

Reputation: 631

Load data from API to HBase

I'm using one 3rd party service which aggregates data and exposes REST API for accessing it.

I'm now trying to fetch those data and load it on our local HBase cluster. I created a java application that fetches data from that 3rd party service, process it and load it on our cluster using HBase client API. For this application, I've to run it manually and also not sure how much HBase Client API's are efficient for loading bulk data.

I came across Sqoop and Cascading-dbmigrate to do bulk transfer from RDBMS. And my question is: are there any similar tool to do bulk data transfer from REST APIs? also, to sync the data in a regular period of time.

Thanks ArunDhaJ http://arundhaj.com

Upvotes: 0

Views: 1278

Answers (1)

Charles Menguy
Charles Menguy

Reputation: 41428

REST APIs are not standardized like RDBMS, to my knowledge there is no tool that could magically load from your API into HBase, you have to build a little something around. For this kind of heavy loading into HBase, a good practice is to use HBase bulk load, which will use less CPU and network resources than simply using the HBase API. This can be done in a few steps:

  1. Prepare the data with a Map/Reduce job using HFileOutputFormat as OutputFormat. This ensures that your job output is written as HFiles which is a very efficient format to load into HBase. You could do it like this:

    job.setOutputFormatClass(HFileOutputFormat.class);
    HFileOutputFormat.setOutputPath(job, path);
    
  2. Load the data using the command line tool completebulkload which takes care of everything so you don't even need to worry about the region servers. This can be done manually like this:

    hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable
    

    I believe this step is run automatically if you use HFileOutputFormat so you may not even need to do this step yourself.

More details on the process here

What you need to do to tie everything together is simply write a program that will fetch data from your API and load into HDFS.

Upvotes: 3

Related Questions