How to control the way Nutch parses and Solr indexes a URL when its HTML structure is unknown?

Question

I am trying to crawl some sites which has poorly maintained HTML structure and I have no control over it to change it. When I look at the nutch crawled data indexed by Solr, the field 'title' looks okay where as the 'content' field includes lot of junk as it grabbed all the text from the html banner with its drop down menu and worked down into the left side menu, navigation, footer etc.

In my case, I am interested to just grab the "Description:" information which is defined in a paragragh on HTML page into 'content' field.

Example: (raw html):

Description: Apache Nutch is an open source Web crawler written in Java. By using it, we can find Web page hyperlinks in an automated manner, reduce lots of maintenance work, for example checking broken links, and create a copy of all the visited pages for searching over.

How can I filter the junk out of the 'content' field and only have the information I am interested in?

How to control the way Nutch parses and Solr indexes a URL when its HTML structure is unknown?

Answers (1)

Related Questions