Hafiz Muhammad Shafiq
Hafiz Muhammad Shafiq

Reputation: 8670

Apache Nutch 2.3.1 plugin not going to work

I have to extract some metadata info of crawled data by Apache Nutch 2.3.1 that is not provided by Nutch at default. For That I have to write a plugin. For learning purpose, I have taken Nutch tutorial as starting point. I know this tutorial is for 1.x version. I have change all required classed and build it successfully. Following are the steps that I have followed.

  1. Create a directory like $NUTCH_HOME/src/plugin/myPlugin
  2. Copy index-metadata to my plugina and create a file myField.java cp -r index-metadata/* myPlugin/
  3. Directory listing should be like
myPlugin/plugin.xml
build.xml
ivy.xml
src/java/org/apache/nutch/indexer/AddField.java
  1. plugin/myplgin/plugin.xml should look like this
<?xml version="1.0" encoding="UTF-8"?>
<plugin id="myPlugin" name="Add Field to Index"
    version="1.0.0" provider-name="your name">
   <runtime>
     <library name="myPlugin.jar">
       <export name="*"/>
     </library>
   </runtime> 
   <extension id="org.apache.nutch.indexer.myPlugin"
       name="Add Field to Index"
       point="org.apache.nutch.indexer.IndexingFilter">
     <implementation id="myPlugin"
         class="org.apache.nutch.indexer.AddField"/>
   </extension>
</plugin>
  1. change build.xml like
<?xml version="1.0" encoding="UTF-8"?>
<project name="myPlugin" default="jar">
  <import file="../build-plugin.xml"/>
</project>
  1. Then

<ant dir="myPlugin" target="deploy" />

  1. edit your ./conf/nutch-site.xml

    <property>
      <name>plugin.includes</name>
      <value>plugin-1|plugin-2|myPlugin</value>
      <description>Added myPlugin</description>
    </property>
    
  2. Add following line in schema.xml and solrindex-mapping.xml respectively

    <field name="pageLength" type="long" stored="true" indexed="true"/>
    <field dest="pageLength" source="pageLength"/>
    
  3. Then I have compiled my written code ( similar to given example in URL )

When I run Nutch in local mode, Following is indexing to solr step log info

Active IndexWriters :
SOLRIndexWriter
    solr.server.url : URL of the SOLR instance (mandatory)
    solr.commit.size : buffer size when sending to SOLR (default 1000)
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
    solr.auth : use authentication (default false)
    solr.auth.username : username for authentication
    solr.auth.password : password for authentication    
IndexingJob: done.

I have added field pageLength in solr schema also. According to my expectation, there should be a new field pageLength with proper values but there is no field in solr.

Where is the problem? Its a simple toy example. This is nutch log file (hadoop.log) output for indexing step

2016-07-26 16:53:25,649 INFO  solr.SolrMappingReader - source: content dest: content
2016-07-26 16:53:25,649 INFO  solr.SolrMappingReader - source: title dest: title
2016-07-26 16:53:25,649 INFO  solr.SolrMappingReader - source: host dest: host
2016-07-26 16:53:25,649 INFO  solr.SolrMappingReader - source: batchId dest: batchId
2016-07-26 16:53:25,649 INFO  solr.SolrMappingReader - source: boost dest: boost
2016-07-26 16:53:25,649 INFO  solr.SolrMappingReader - source: digest dest: digest
2016-07-26 16:53:25,649 INFO  solr.SolrMappingReader - source: tstamp dest: tstamp
2016-07-26 16:53:25,649 INFO  solr.SolrMappingReader - source: pageLength dest: pageLength
2016-07-26 16:53:26,140 INFO  solr.SolrIndexWriter - Total 1 document is added.
2016-07-26 16:53:26,140 INFO  indexer.IndexingJob - IndexingJob: done.

How I can confirm that plugin is loaded by nutch ? Second, is there any way to test Nutch plugin before I configure it to nutch for crawling?

Upvotes: 1

Views: 279

Answers (1)

Anup
Anup

Reputation: 576

try changing the extension id in plugin.xml. Change it to "org.apache.nutch.indexer.AddField" and re-build Nutch

<extension id="org.apache.nutch.indexer.AddField"
       name="Add Field to Index"
       point="org.apache.nutch.indexer.IndexingFilter">
     <implementation id="myPlugin"
         class="org.apache.nutch.indexer.AddField"/>
</extension>

I think that should solve the issue.

Also just to verify that the control is coming to your plugin class or not add some info log in your code like

LOG.info("printing from plugin");
If you are able to see these logs in hadoop.log that means control is coming to plugin class.

Upvotes: 1

Related Questions