Reputation: 351
i have installed nutch and solr for crawling a website and search in it; as you know we can index meta tags of webpages into solr with parse meta tags plugin of nutch.(http://wiki.apache.org/nutch/IndexMetatags) now i want to know is there any way to crawl another html tag to solr that isn't meta?(plugin or anyway) like this:
<div id=something>
me specific tag
</div>
indeed i want to add a field to solr (something) that have value of "me specific tag" in this page.
any idea?
Upvotes: 5
Views: 9803
Reputation: 807
You can use one of these custom plugins to parse xml files based on xpath (or css selectors):
Upvotes: 1
Reputation: 1693
u have to just try http://lifelongprogrammer.blogspot.in/2013/08/nutch2-crawl-and-index-extra-tag.html the tutorial said img tag how to get and what all are steps are there mention...
Upvotes: 2
Reputation: 5220
I made my own plugin for something similar you want to. The config file for mapping NutchDocument to SolrDocument is in $NUTCH_HOME/conf/solrindex-mapping.xml. Here you can add your own tags. But still you have to fill your own tags somewhere.
Here are some tips to plugin:
your parsed informations put into page metadata like this
page.putToMetadata(new Utf8("yourKEY"), ByteBuffer.wrap(YourByteArrayParsedFromMetaData));
in YourIndexingFilter add the metadata from page (page.getMetadata) to NutchDocument
doc.add("your_specific_tag", value);
most important!!!!!
put your_specific_tag to fileds of:
field name="your_specific_tag" type="string" stored="true" indexed="true"
field dest="your_specific_tag" source="your_specific_tag"
Upvotes: 3
Reputation: 52769
You may want to check Nutch Plugin which should allow you to extract an element from a web page.
Upvotes: 0