Reputation: 1572
I am implementing a Notification system based on publish subscribe model to notify about the availability of data as it arrives/loaded to HDFS. I did n't find a ways where to look for this. Is there any HDFS API which can be used to do this or what method should I use to get information of new data written to HDFS? I am using Hadoop v2.0.2 and I don't want to use HCatalog, I want to implement my own tool to do this.
Upvotes: 6
Views: 3191
Reputation: 48003
What you are looking for is Oozie Coordinator
.
HDFS
is a file system, so something must be built on top of HDFS to check for file availability. HBase
has coprocessor which are triggered procedures . But it is only available for HBase tables. So it cannot be used for detecting data availabilty in HDFS.
Oozie is a workflow scheduler system to manage Hadoop jobs. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty. Also you can execute other programs from it :
Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts).
So you can use the file availability trigger for your notification system too.
Upvotes: 3
Reputation: 7362
If you use HDFS you might want to check out HBase as it has the functionality you want. In HBase, you can create a pre-put (or post-put) coprocessor essentially acting equivilant to a MySQL Trigger- running a bit of code for every time data is written to a table.
If HBase doesn't suit your use case and you must use HDFS, AFAIK there aren't similar triggers. You can try wrapping the HDFS API with your own code to perform the notification whenever data is written to your file system under the appropriate circumstances. Alternatively, you can poll HDFS for changes (which sounds like an ugly alternative)...
Hope that helps
Upvotes: 1