Matt07
Matt07

Reputation: 523

How to make Apache Nutch indexing while crawling

I started using Apache Nutch (v1.5.1) to index all the website under some certain domain. There is huge number of websites (in the order of milions) in my domains and I need to index them step by step instead of waiting the end of the whole process.

I found this in nutch wiki (here http://wiki.apache.org/nutch/NutchTutorial/#A3.2_Using_Individual_Commands_for_Whole-Web_Crawling) something that should work. The idea is to make a script witch calls every single step of my process (crawl, fetch, parse, ...) on a certain amount of data (for example 1000 URL) cyclically.

bin/nutch inject crawl/crawldb crawl/seed.txt

bin/nutch generate crawl/crawldb crawl/segments -topN 25
s1=`ls -d crawl/segments/2* | tail -1`
echo $s1

bin/nutch fetch $s1
bin/nutch parse $s1
bin/nutch updatedb crawl/crawldb $s1

bin/nutch generate crawl/crawldb crawl/segments -topN 25
s2=`ls -d crawl/segments/2* | tail -1`
echo $s2

bin/nutch fetch $s2
bin/nutch parse $s2
bin/nutch updatedb crawl/crawldb $s2

...

bin/nutch invertlinks crawl/linkdb -dir crawl/segments
bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*

My question is: is there any way to specify this setting directly into Nutch and make him do this stuff in a parallel and more trasparent way? For example on separated threds?

Thank for answering.

UPDATE

I tried to create the script (the code is above) but unfortunatlly I get an error on the invert link phases. This is the output:

LinkDb: starting at 2012-07-30 11:04:58
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: file:/home/apache-nutch-1.5-bin/crawl/segments/20120730102927
LinkDb: adding segment: file:/home/apache-nutch-1.5-bin/crawl/segments/20120704094625
...
LinkDb: adding segment: file:/home/apache-nutch-1.5-bin/crawl/segments/20120704095730

LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
file:/home/apache-nutch-1.5-bin/crawl/segments/20120730102927/parse_data

Input path does not exist:
file:/home/apache-nutch-1.5-bin/crawl/segments/20120704094625/parse_data
...

Thanks for your help.

Upvotes: 2

Views: 3889

Answers (1)

Jacob Sanford
Jacob Sanford

Reputation: 134

(If I had enough rep I would post this as a comment).

Remember that the -depth switch refers to EACH CRAWL, and not the overall depth of the site it will crawl. That means that the second run of depth=1 will descend one MORE level from the already indexed data and stop at topN documents.

So, If you aren't in a hurry to fully populate the data, I've had a lot of success in a similar situation by performing a large number of repeated shallow nutch crawl statements (using smallish -depth (3-5) and -topN (100-200) variables) from a large seed list. This will ensure that only (depth * topN) pages get indexed in each batch, and the index will start delivering URLs within a few minutes.

Then, I typically set up the crawl to fire off every (1.5*initial crawl time average) seconds and let it rip. Understandably, at only 1,000 documents per crawl, it can take a lot of time to get through a large infrastructure, and (after indexing, the paused time and other overhead) the method can multiply the time to crawl the whole stack.

First few times through the infrastructure, it's a pretty bad slog. As the adaptive crawling algo starts to kick in, however, and the recrawl times start to approach reasonable values : the package starts really delivering.

(This is somewhat similar to the "whole web crawling" method you mention in the nutch wiki, which advises you to break the data into 1,000 page segments, but much more terse and understandable for a beginner.)

Upvotes: 3

Related Questions