user3125823
user3125823

Reputation: 1900

Should I use Nutch 1x or 2x with Elasticsearch

I've been using Nutch 1.10 to index data into Elasticsearch for a couple of years. Awhile ago, I decided to start the upgrade process to newer versions of both Nutch and ES.

After lots of Googling, it seems people are using Nutch 2.x more and more even though it seems that Nutch 1.x is faster and under more active development. It also seems that after Nutch 1.10, it has become more difficult to use Nutch 1x with ES.

It seems the BIG difference is that you can store the crawled data into different databases with Nutch 2.x. It seems that Nutch 1.x is really good at crawling and crawling fast but thats it.

So which version of Nutch is best to use with ES v2+ or ES v5x?

Upvotes: 2

Views: 1754

Answers (2)

Yash Thenuan
Yash Thenuan

Reputation: 631

Go for Nutch 1.X, I am using nutch 1.14 With ES 5.6.0 using indexer-elastic-rest and Its Working Seamlessly without any Issue.

Upvotes: 2

Jorge Luis
Jorge Luis

Reputation: 3298

If you're running Nutch for production it's probably better to stick to Nutch 1.x, it has more features and as you said performs better than 2.x. As for the ES compatibility, I don't think there is a lot of difference.

Nutch 1.x actually is shipped with compatibility with ES 5.3, this means that if you download the .zip file (or build directly from the source) then you'll get the ES client libraries for v5.3.

There is a bit of documentation explaining how to upgrade https://github.com/apache/nutch/blob/master/src/plugin/indexer-elastic/howto_upgrade_es.txt. Of course, this "path" for upgrading depends on how well the ES client library doesn't change its public API (which could happen) at which point a PR will be more than welcome.

Nutch 2.x is a little behind (still ships with support to ES 1.x,2.x) but it has a similar upgrade documentation subject to the same warning as before.

Another option is to use the indexer-elastic-rest plugin which doesn't rely on the ES client library but on the Jest library from Searchly (https://github.com/searchbox-io/Jest) this means that the documents will be sent using the REST API instead of using the binary protocol that ES client library supports.

In any case, Nutch 1.x is more actively maintained and as you can see is more updated than the 2.x branch.

Upvotes: 4

Related Questions