What the outputs exactly are when integrating Nutch1.4 and Solr?

Question

When i integrating Nutch1.4 and solr, i notice there are two groups of outputs there.

I think the workflow may like this in my site:

1、 Nutch-1.4 crawls the websites and generates three folders : "crawler/crawldb"、"crawler/linkdb"、"crawler/segments".

2、 Solr indexes the folder "crawler/" and generates its own folders "data/index"、"data/spellchecker".

Totally , there are five folders here.

What i want to know are:

1、What are these five folders exactly contains ?

2、Where does the "PageRank(or LinkRank)" works ?

3、Does Nutch indexes the page and solr indexes them again ?

Many Thanks.

Tejas Patil · Accepted Answer

For question #1: What are these five folders exactly contains ?

Here are the details from the nutch wiki page:

The crawl database, or crawldb. This contains information about every URL known to Nutch, including whether it was fetched, and, if so, when.

The link database, or linkdb. This contains the list of known links to each URL, including both the source URL and anchor text of the link.

A set of segments. Each segment is a set of URLs that are fetched as a unit. Segments are directories with the following subdirectories:

a crawl_generate names a set of URLs to be fetched
a crawl_fetch contains the status of fetching each URL
a content contains the raw content retrieved from each URL
a parse_text contains the parsed text of each URL
a parse_data contains outlinks and metadata parsed from each URL
a crawl_parse contains the outlink URLs, used to update the crawldb

The index folder contains the indexes created from the crawled content and the linkdb.

spellchecker : This is spell check index generated for improving queries. This and this are worth reading if you want more knowledge about it. Also see this.

For question #2: Where does the "PageRank(or LinkRank)" works ?

Read this and this. Not sure if this and this will be helpful but will add to your knowledge.

For question #3: Does Nutch indexes the page and solr indexes them again ?

Indexes for the crawled data are generated by Apache Solr not Nutch.

This is the internal working: Nutch delegates all data collected in parsing to the IndexingFilter extension which generates the data to be indexed. The output of the filter is a NutchDocument which again is delegated to Nutch. Nutch then decides if the data should be indexed based on the mapping file which defines which NutchDocument fields will be mapped to SolrDocument fields is read by Nutch.

What the outputs exactly are when integrating Nutch1.4 and Solr?

Answers (1)

For question #1: What are these five folders exactly contains ?

For question #2: Where does the "PageRank(or LinkRank)" works ?

For question #3: Does Nutch indexes the page and solr indexes them again ?

Related Questions