user2293660
user2293660

Reputation: 11

How to index apache nutch fetched content without parsing into solr

I need to index fetched content crawled by nutch into solr. Solrjob in nutch indexes only parse content. And i need the content with all HTML tags. Can anyone guide me on this?

Thanks Sudh

Upvotes: 1

Views: 1658

Answers (2)

nimeshjm
nimeshjm

Reputation: 1708

Nutch has a series of parsers and filters that will extract content from the fetched HTML.

You need to implement an HtmlParserFilter, write the raw content into a metatag and insert it into a SOLR field.

The tutorial below is about an indexing filter but it follows the same flow.

Nutch plugin

Your class should implement "HtmlParseFilter" instead of "IndexingFilter". override the filter() method:

@Override
public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) {
    Metadata metadata = parseResult.get(content.getUrl()).getData().getParseMeta();
    byte[] rawContent = content.getContent();
    String str = new String(rawContent, "UTF-8");
    metadata.add("rawcontent", str);
        return parseResult;
}

After that, change your schema.xml and add the new field:

<field name="metatag.rawcontent" type="text" stored="true" indexed="true" multiValued="false"/>

Compile, deploy, re-crawl, re-index.

You should now see raw HTML content in your SOLR index.

Note: --

Make sure you have enabled metatags plugins. This is important because you are essentially storing rawcontent as the metadata.

Upvotes: 2

vetus
vetus

Reputation: 11

You could use nutch 2.1 with Cassandra backend, or Mysql ( it has some bugs ), or HBase. Then you will be able to make Queries in the database, and obtain all html code from pages.

Upvotes: 0

Related Questions