Reputation: 8670
I have to extract some metadata info of crawled data by Apache Nutch 2.3.1 that is not provided by Nutch at default. For That I have to write a plugin. For learning purpose, I have taken Nutch tutorial as starting point. I know this tutorial is for 1.x version. I have change all required classed and build it successfully. Following are the steps that I have followed.
myPlugin/plugin.xml build.xml ivy.xml src/java/org/apache/nutch/indexer/AddField.java
<?xml version="1.0" encoding="UTF-8"?> <plugin id="myPlugin" name="Add Field to Index" version="1.0.0" provider-name="your name"> <runtime> <library name="myPlugin.jar"> <export name="*"/> </library> </runtime> <extension id="org.apache.nutch.indexer.myPlugin" name="Add Field to Index" point="org.apache.nutch.indexer.IndexingFilter"> <implementation id="myPlugin" class="org.apache.nutch.indexer.AddField"/> </extension> </plugin>
<?xml version="1.0" encoding="UTF-8"?> <project name="myPlugin" default="jar"> <import file="../build-plugin.xml"/> </project>
<ant dir="myPlugin" target="deploy" />
edit your ./conf/nutch-site.xml
<property> <name>plugin.includes</name> <value>plugin-1|plugin-2|myPlugin</value> <description>Added myPlugin</description> </property>
Add following line in schema.xml and solrindex-mapping.xml respectively
<field name="pageLength" type="long" stored="true" indexed="true"/> <field dest="pageLength" source="pageLength"/>
Then I have compiled my written code ( similar to given example in URL )
When I run Nutch in local mode, Following is indexing to solr step log info
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication
IndexingJob: done.
I have added field pageLength in solr schema also. According to my expectation, there should be a new field pageLength with proper values but there is no field in solr.
Where is the problem? Its a simple toy example. This is nutch log file (hadoop.log) output for indexing step
2016-07-26 16:53:25,649 INFO solr.SolrMappingReader - source: content dest: content
2016-07-26 16:53:25,649 INFO solr.SolrMappingReader - source: title dest: title
2016-07-26 16:53:25,649 INFO solr.SolrMappingReader - source: host dest: host
2016-07-26 16:53:25,649 INFO solr.SolrMappingReader - source: batchId dest: batchId
2016-07-26 16:53:25,649 INFO solr.SolrMappingReader - source: boost dest: boost
2016-07-26 16:53:25,649 INFO solr.SolrMappingReader - source: digest dest: digest
2016-07-26 16:53:25,649 INFO solr.SolrMappingReader - source: tstamp dest: tstamp
2016-07-26 16:53:25,649 INFO solr.SolrMappingReader - source: pageLength dest: pageLength
2016-07-26 16:53:26,140 INFO solr.SolrIndexWriter - Total 1 document is added.
2016-07-26 16:53:26,140 INFO indexer.IndexingJob - IndexingJob: done.
How I can confirm that plugin is loaded by nutch ? Second, is there any way to test Nutch plugin before I configure it to nutch for crawling?
Upvotes: 1
Views: 279
Reputation: 576
try changing the extension id in plugin.xml. Change it to "org.apache.nutch.indexer.AddField" and re-build Nutch
<extension id="org.apache.nutch.indexer.AddField"
name="Add Field to Index"
point="org.apache.nutch.indexer.IndexingFilter">
<implementation id="myPlugin"
class="org.apache.nutch.indexer.AddField"/>
</extension>
I think that should solve the issue.
Also just to verify that the control is coming to your plugin class or not add some info log in your code like
LOG.info("printing from plugin");
If you are able to see these logs in hadoop.log that means control is coming to plugin class.
Upvotes: 1