Ankit Agrawal
Ankit Agrawal

Reputation: 17

index crawled data from Apache nutch using elasticsearch?

I have apache nutch 1.7 and Elasticsearch 1.4.4 on aws ec2 ubuntu instance. I crawled data using Nutch but how we can index data using elasticsearch? No official documentation is available related to it.

Upvotes: 0

Views: 1686

Answers (2)

Sujen Shah
Sujen Shah

Reputation: 270

In your nutch-site.xml add the following properties:

<property>
        <name>plugin.includes</name>
        <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-elastic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>

The above would would make elasticsearch as the indexer. Following is specifying the host of elasticsearch

<property>
        <name>elastic.host</name>
        <value>localhost</value>
</property>

The other optional properties you can set are elastic.port, elastic.cluster, etc.

Now you specified that you have already crawled the data and now want to index it, so you can use the

./bin/nutch index <crawldb> -dir <segment_dir>

This would index all the crawled data residing in the segments. The you can check your elasticsearch index for the documents.

Upvotes: 1

aalbahem
aalbahem

Reputation: 782

Enable elasticsearch indexer in the configuration. add the elastic-indexer to the plugin linclude property list. see below:

    <property>
            <name>plugin.includes</name>
            <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-elastic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
    </property>

Upvotes: 1

Related Questions