Reputation: 11
I need to index fetched content crawled by nutch into solr. Solrjob in nutch indexes only parse content. And i need the content with all HTML tags. Can anyone guide me on this?
Thanks Sudh
Upvotes: 1
Views: 1658
Reputation: 1708
Nutch has a series of parsers and filters that will extract content from the fetched HTML.
You need to implement an HtmlParserFilter, write the raw content into a metatag and insert it into a SOLR field.
The tutorial below is about an indexing filter but it follows the same flow.
Your class should implement "HtmlParseFilter" instead of "IndexingFilter". override the filter() method:
@Override
public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) {
Metadata metadata = parseResult.get(content.getUrl()).getData().getParseMeta();
byte[] rawContent = content.getContent();
String str = new String(rawContent, "UTF-8");
metadata.add("rawcontent", str);
return parseResult;
}
After that, change your schema.xml and add the new field:
<field name="metatag.rawcontent" type="text" stored="true" indexed="true" multiValued="false"/>
Compile, deploy, re-crawl, re-index.
You should now see raw HTML content in your SOLR index.
Note: --
Make sure you have enabled metatags plugins. This is important because you are essentially storing rawcontent as the metadata.
Upvotes: 2
Reputation: 11
You could use nutch 2.1 with Cassandra backend, or Mysql ( it has some bugs ), or HBase. Then you will be able to make Queries in the database, and obtain all html code from pages.
Upvotes: 0