Reputation: 869
I'm looking for a framework what I can use for the following scenario: I have 2 web-services. I call the first service which has json response. In the json response I have some Ids, which I use to call other services and then I merge the services responses and store it in db. I want to call these services every day to update my db.
What I found is Nutch, but it looks like it is a webcrawler for mostly html pages. Is there any framework that I can use for the scenario above? I'm looking for a fault tolerant salable java framework.
Thanks!
Upvotes: 0
Views: 101
Reputation: 4854
You could use Nutch, it is not limited to HTML. If something can be accessed via a URL then Nutch will fetch it, however you might need to implement some custom parsers and indexers to deal with your content.
Alternatively storm-crawler would be both scalable and customisable. You might find it easier to learn than Nutch and more flexible. In your use case you could have one or more queues (e.g. RabbitMQ, AWS SQS, etc...) in front of SC. The seed URLs would be the ones to use on the first service and you could have custom parse filters to generate the URLs for the second one. Finally you'd have a bespoke indexing bolt sending the data to persist to the DB. There's loads of resources available for Storm you could piggy back.
HTH
Upvotes: 1