index crawled data from Apache nutch using elasticsearch?

Question

I have apache nutch 1.7 and Elasticsearch 1.4.4 on aws ec2 ubuntu instance. I crawled data using Nutch but how we can index data using elasticsearch? No official documentation is available related to it.

Sujen Shah · Accepted Answer

In your nutch-site.xml add the following properties:


        plugin.includes
        protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-elastic|scoring-opic|urlnormalizer-(pass|regex|basic)

The above would would make elasticsearch as the indexer. Following is specifying the host of elasticsearch


        elastic.host
        localhost

The other optional properties you can set are elastic.port, elastic.cluster, etc.

Now you specified that you have already crawled the data and now want to index it, so you can use the

./bin/nutch index  -dir

This would index all the crawled data residing in the segments. The you can check your elasticsearch index for the documents.

index crawled data from Apache nutch using elasticsearch?

Answers (2)

Related Questions