Reputation: 20450
When i integrating Nutch1.4 and solr, i notice there are two groups of outputs there.
I think the workflow may like this in my site:
1、 Nutch-1.4 crawls the websites and generates three folders : "crawler/crawldb"、"crawler/linkdb"、"crawler/segments".
2、 Solr indexes the folder "crawler/" and generates its own folders "data/index"、"data/spellchecker".
Totally , there are five folders here.
What i want to know are:
1、What are these five folders exactly contains ?
2、Where does the "PageRank(or LinkRank)" works ?
3、Does Nutch indexes the page and solr indexes them again ?
Many Thanks.
Upvotes: 0
Views: 294
Reputation: 6169
Here are the details from the nutch wiki page:
The crawl database, or crawldb. This contains information about every URL known to Nutch, including whether it was fetched, and, if so, when.
The link database, or linkdb. This contains the list of known links to each URL, including both the source URL and anchor text of the link.
A set of segments. Each segment is a set of URLs that are fetched as a unit. Segments are directories with the following subdirectories:
The index folder contains the indexes created from the crawled content and the linkdb.
spellchecker : This is spell check index generated for improving queries. This and this are worth reading if you want more knowledge about it. Also see this.
Read this and this. Not sure if this and this will be helpful but will add to your knowledge.
Indexes for the crawled data are generated by Apache Solr not Nutch.
This is the internal working: Nutch delegates all data collected in parsing to the IndexingFilter extension which generates the data to be indexed. The output of the filter is a NutchDocument which again is delegated to Nutch. Nutch then decides if the data should be indexed based on the mapping file which defines which NutchDocument fields will be mapped to SolrDocument fields is read by Nutch.
Upvotes: 2