Reputation: 1680
I am trying to crawl some sites which has poorly maintained HTML structure and I have no control over it to change it. When I look at the nutch crawled data indexed by Solr, the field 'title' looks okay where as the 'content' field includes lot of junk as it grabbed all the text from the html banner with its drop down menu and worked down into the left side menu, navigation, footer etc.
In my case, I am interested to just grab the "Description:" information which is defined in a paragragh on HTML page into 'content' field.
Example: (raw html):
<p><strong>Description:</strong> Apache Nutch is an open source Web crawler written in Java. By using it, we can find Web page hyperlinks in an automated manner, reduce lots of maintenance work, for example checking broken links, and create a copy of all the visited pages for searching over.
How can I filter the junk out of the 'content' field and only have the information I am interested in?
Upvotes: 0
Views: 941
Reputation: 1708
You can use the plugin below to extract content based on XPath queries. If your content is in a specific div, you can use this plugin to extract the content you want from that specific section.
Upvotes: 1