yoganandh
yoganandh

Reputation: 277

How to crawl and parse only precise data using Nutch?

I'm new to Nutch and crawling. I have installed Nutch 2.0, crawled and indexed the data using Solr 4.5 by following some basic tutorials. Now I don't want to parse all the text content of a page, I want to customize it like Nutch should crawl the page and scrape/fetch only the data related to address because my use case is to crawl URLs and parse only address info as text.

For example, I need to crawl and parse only the text content which has address information, email id, phone number and fax number.

  1. How should I do this? Is there any plugin already available for this?
  2. If I want to write a customized parser for this can anyone help me in this regards?

Upvotes: 2

Views: 1652

Answers (2)

Kaidul
Kaidul

Reputation: 15875

Check [NUTCH-978] which introduces a plugin called XPath which allows the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.

Upvotes: 1

Jorge Luis
Jorge Luis

Reputation: 3253

Checkout NUTCH-1870 a work in progress on a generic XPath plugin for Nutch, the alternative is to write a custom HtmlParseFilter that scrap the data that you want. A good (and simple) example is the headings plugin. Keep in mind that both of this links are for the 1.x branch of Nutch, and you're working with the 2.x although things are different in some degree the logic should be portable, the other alternative is using the 1.x branch.

Based on your comment:

Since you don't know the structure of the webpage, the problem is somehow different: Essentially you'll need to "teach" Nutch how to detect the text you want, based on some regexp or using some library that does address extraction out of plain text like jgeocoder library, you'll need to parse (iterate on every node of the webpage) trying to find something that resembles an address, phone number, fax number, etc. This is kind of similar to what the headings plugin does, but instead of looking for addresses or phone numbers it just finds the title nodes in the HTML structure. This could be a starting point to write some plugin that does what you want, but I don't think there is anything out of the box for this do.

Upvotes: 2

Related Questions