Reputation: 1900
I've been using Nutch 1.10 to index data into Elasticsearch for a couple of years. Awhile ago, I decided to start the upgrade process to newer versions of both Nutch and ES.
After lots of Googling, it seems people are using Nutch 2.x more and more even though it seems that Nutch 1.x is faster and under more active development. It also seems that after Nutch 1.10, it has become more difficult to use Nutch 1x with ES.
It seems the BIG difference is that you can store the crawled data into different databases with Nutch 2.x. It seems that Nutch 1.x is really good at crawling and crawling fast but thats it.
So which version of Nutch is best to use with ES v2+ or ES v5x?
Upvotes: 2
Views: 1754
Reputation: 631
Go for Nutch 1.X, I am using nutch 1.14 With ES 5.6.0 using indexer-elastic-rest and Its Working Seamlessly without any Issue.
Upvotes: 2
Reputation: 3298
If you're running Nutch for production it's probably better to stick to Nutch 1.x, it has more features and as you said performs better than 2.x. As for the ES compatibility, I don't think there is a lot of difference.
Nutch 1.x actually is shipped with compatibility with ES 5.3, this means that if you download the .zip
file (or build directly from the source) then you'll get the ES client libraries for v5.3.
There is a bit of documentation explaining how to upgrade https://github.com/apache/nutch/blob/master/src/plugin/indexer-elastic/howto_upgrade_es.txt. Of course, this "path" for upgrading depends on how well the ES client library doesn't change its public API (which could happen) at which point a PR will be more than welcome.
Nutch 2.x is a little behind (still ships with support to ES 1.x,2.x) but it has a similar upgrade documentation subject to the same warning as before.
Another option is to use the indexer-elastic-rest
plugin which doesn't rely on the ES client library but on the Jest library from Searchly (https://github.com/searchbox-io/Jest) this means that the documents will be sent using the REST API instead of using the binary protocol that ES client library supports.
In any case, Nutch 1.x is more actively maintained and as you can see is more updated than the 2.x branch.
Upvotes: 4