Freedom
Freedom

Reputation: 833

Is Solr necessary to index crawled data for Nutch?

I found that Nutch 1.4 only contains one Indexer/solrindex. Is Solr the only way for Nutch to index the crawled data? If not, what are the other ways?

I'm also wondering why Nutch 1.4 uses Solr to index the data. Why not do it itself? Doesn't it increase the coupling of these two projects?

Upvotes: 1

Views: 680

Answers (1)

Tejas Patil
Tejas Patil

Reputation: 6169

Solr uses lucene internally. Since 2005, nutch was designated as a subproject of Lucene. Historically, nutch used lucene indexes and was a full fledged search engine (this was until ver 1.0) . It had crawling capability and even support to index data and UI via browser to query the indexed data (similar to that like a google search).

As the initial design was based around lucene (it was another apache project which earned lot of kudos at that period and still rocks), the nutch code was NOT changed or made generic so that other indexing frameworks could have been used. If you want to, then you need lots of efforts to put your indexing framework with it.

In recent versions, (nutch ver 1.3 and further), the Nutch dev team realized that its difficult to track the work involved in indexing due to changing needs and expertise required. It was better to delegate the responsibility of indexing to Solr (its a lucene based indexing framework). The Nutch developers focus only on the crawling part. So now nutch is not a full fledged search engine but its a full fledged web crawler.

Hope this answers your query. You can browse nutch news for more info.

Latest happenings:

Recently there are efforts going on to create a generic library for crawlers (under commons). This project is commons-crawler which will have all functions required for a web crawler and can be used for creating crawlers. Further nutch versions will be using this library as a dependency.

Upvotes: 2

Related Questions