Bilbo Baggins
Bilbo Baggins

Reputation: 3019

Modifying the Nutch crawler to parse the page and get certain data from the pages crawled

I want to crawl several sites and collect data based on the language i.e. "Java" etc. I am new to Nutch crawler. I just finished setup of Nutch 2.3 with HBase. How to customize the crawling so that when each page is parsed I can get the links within that page and extract some data from it. Such as date, topic etc.

Thank you.

Upvotes: 3

Views: 973

Answers (1)

Jakub Janoštík
Jakub Janoštík

Reputation: 186

Probably late, but for anyone facing same issue. This is solved by providing your own ParseFilter plugin.

You can read about plugins at this documentation

Basically you implement method parse which has DocumentFragment object as argument. From DocumentFragment you can then parse whatever info you need using xPath. Parsed data can be saved inside WebPage metadata.

After you implement plugin you just have to include it into source, use in nutch-site.xml, build and you are good to go.

Upvotes: 1

Related Questions