Reputation: 631
I'm using one 3rd party service which aggregates data and exposes REST API for accessing it.
I'm now trying to fetch those data and load it on our local HBase cluster. I created a java application that fetches data from that 3rd party service, process it and load it on our cluster using HBase client API. For this application, I've to run it manually and also not sure how much HBase Client API's are efficient for loading bulk data.
I came across Sqoop and Cascading-dbmigrate to do bulk transfer from RDBMS. And my question is: are there any similar tool to do bulk data transfer from REST APIs? also, to sync the data in a regular period of time.
Thanks ArunDhaJ http://arundhaj.com
Upvotes: 0
Views: 1278
Reputation: 41428
REST APIs are not standardized like RDBMS, to my knowledge there is no tool that could magically load from your API into HBase, you have to build a little something around. For this kind of heavy loading into HBase, a good practice is to use HBase bulk load, which will use less CPU and network resources than simply using the HBase API. This can be done in a few steps:
Prepare the data with a Map/Reduce job using HFileOutputFormat
as OutputFormat
. This ensures that your job output is written as HFiles which is a very efficient format to load into HBase. You could do it like this:
job.setOutputFormatClass(HFileOutputFormat.class);
HFileOutputFormat.setOutputPath(job, path);
Load the data using the command line tool completebulkload
which takes care of everything so you don't even need to worry about the region servers. This can be done manually like this:
hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable
I believe this step is run automatically if you use HFileOutputFormat
so you may not even need to do this step yourself.
More details on the process here
What you need to do to tie everything together is simply write a program that will fetch data from your API and load into HDFS.
Upvotes: 3