Reputation: 17
I have apache nutch 1.7 and Elasticsearch 1.4.4 on aws ec2 ubuntu instance. I crawled data using Nutch but how we can index data using elasticsearch? No official documentation is available related to it.
Upvotes: 0
Views: 1686
Reputation: 270
In your nutch-site.xml add the following properties:
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-elastic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
The above would would make elasticsearch as the indexer. Following is specifying the host of elasticsearch
<property>
<name>elastic.host</name>
<value>localhost</value>
</property>
The other optional properties you can set are elastic.port, elastic.cluster, etc.
Now you specified that you have already crawled the data and now want to index it, so you can use the
./bin/nutch index <crawldb> -dir <segment_dir>
This would index all the crawled data residing in the segments. The you can check your elasticsearch index for the documents.
Upvotes: 1
Reputation: 782
Enable elasticsearch indexer in the configuration. add the elastic-indexer to the plugin linclude property list. see below:
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-elastic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
Upvotes: 1