Amir
Amir

Reputation: 351

how to parse html with nutch and index specific tag to solr?

i have installed nutch and solr for crawling a website and search in it; as you know we can index meta tags of webpages into solr with parse meta tags plugin of nutch.(http://wiki.apache.org/nutch/IndexMetatags) now i want to know is there any way to crawl another html tag to solr that isn't meta?(plugin or anyway) like this:

<div id=something>
      me specific tag
</div>

indeed i want to add a field to solr (something) that have value of "me specific tag" in this page.

any idea?

Upvotes: 5

Views: 9803

Answers (4)

tahagh
tahagh

Reputation: 807

You can use one of these custom plugins to parse xml files based on xpath (or css selectors):

Upvotes: 1

Arul Pandian
Arul Pandian

Reputation: 1693

u have to just try http://lifelongprogrammer.blogspot.in/2013/08/nutch2-crawl-and-index-extra-tag.html the tutorial said img tag how to get and what all are steps are there mention...

Upvotes: 2

Babu
Babu

Reputation: 5220

I made my own plugin for something similar you want to. The config file for mapping NutchDocument to SolrDocument is in $NUTCH_HOME/conf/solrindex-mapping.xml. Here you can add your own tags. But still you have to fill your own tags somewhere.

Here are some tips to plugin:

  • read http://wiki.apache.org/nutch/WritingPluginExample, here you can find how to make your plugin very simply
  • in your plugin extend the ParseFilter and IndexingFilter.
  • in YourParseFilter you can use NodeWalker to find your specific div
  • your parsed informations put into page metadata like this

    page.putToMetadata(new Utf8("yourKEY"), ByteBuffer.wrap(YourByteArrayParsedFromMetaData));

  • in YourIndexingFilter add the metadata from page (page.getMetadata) to NutchDocument

    doc.add("your_specific_tag", value);

  • most important!!!!!

  • put your_specific_tag to fileds of:

    • Solr config file schema.xml (and restart Solr)

    field name="your_specific_tag" type="string" stored="true" indexed="true"

    • Nutch config file schema.xml (don't know if it is realy neccessary)
    • Nutch config file solrindex-mapping.xml

    field dest="your_specific_tag" source="your_specific_tag"

Upvotes: 3

Jayendra
Jayendra

Reputation: 52769

You may want to check Nutch Plugin which should allow you to extract an element from a web page.

Upvotes: 0

Related Questions