Reputation: 833
all
I'm wondering how nutch works with hadoop cluster. How does it split a job to the other nodes? How does it assure that different nodes in the cluster won't request the same url?
Thanks in adv.
Upvotes: 4
Views: 1375
Reputation: 6169
The phases of nutch are : Inject -> generate -> Fetch -> Parse -> Update -> Index
Of these Fetch phase is the place where nutch sends request for the urls (and hence i will be talking only about this phase and generate phase in answer.)
Generate phase creates fetch list of the urls in the crawldb. While creation of fetchlist, the urls belonging to the same host typically fall in the same partition as the partitioning function is based on the hostname. So, the final fetch list will look like this:
fetch list 1 : all urls of host a1, b1, c1
fetch list 2 : all urls of host a2, b2, c2
.............
.............
Now, when Fetch phase reads these fetchlists, each fetchlist is processed by /assigned to a single mapper of fetch phase. So,
number of reducers in generate partition phase
= the number of fetchlists created
= number of maps in fetch phase
If a mapper in fetch phase gets the bunch of urls of host A, no other map will have urls of the same host. Offcourse, each map can have urls of multiple hosts but no other mapper will have urls from those hosts.
Now digging deep into mapper of fetch:
It will have urls of say n hosts h1, h2,... hn. Then fetchqueues are formed per host basis. All urls (fetch items) are populated in the fetchqueue of their respective hosts. Fetcher threads polls on the fetchqueues, pick up urls from there and send the request and write back the results to hdfs. After this is done, they look out for other fetchitems(urls) which can be processed.
I think that i could manage to put in the mess in understandable way. For more details see the Fetcher.java code for the working.
Note: The urls can be grouped on basis of IP too. Even u can tweak to make nutch not to group urls based on hostname/IP. Both these things depend on yr configurations. By default it will use hostname for grouping urls.
Upvotes: 6