Reputation: 7951
We are planning to use REST API calls to ingest data from an endpoint and store the data to HDFS. The REST calls are done in a periodic fashion (daily or maybe hourly).
I've already done Twitter ingestion using Flume, but I don't think using Flume would suit my current use-case because I am not using a continuous data firehose like this one in Twitter, but rather discrete regular time-bound invocations.
The idea I have right now, is to use custom Java that takes care of REST API calls and saves to HDFS, and then use Oozie coordinator on that Java jar.
I would like to hear suggestions / alternatives (if there's easier than what I'm thinking right now) about design and which Hadoop-based component(s) to use for this use-case. If you feel I can stick to Flume, then kindly give me also an idea how to do this.
Upvotes: 3
Views: 1177
Reputation: 3798
As stated in the Apache Flume web:
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.
As you can see, among the features attributed to Flume is the gathering of data. "Pushing-like or emitting-like" data sources are easy to integrate thanks to HttpSource
, AvroSurce
, ThriftSource
, etc. In your case, where the data must be let's say "actively pulled" from a http-based service, the integration is not so obvious, but can be done. For instance, by using the ExecSource
, which runs a script getting the data and pushing it to the Flume agent.
If you use a proprietary code in charge of pulling the data and writting it into HDFS, such a design will be OK, but you will be missing some interesting built-in Flume characteristics (that probably you will have to implement by yourself):
Upvotes: 3