sushmita
sushmita

Reputation: 25

relation between fetcher.server.min.delay and fetcher.threads.fetch in nutch 1.13

i am running nutch in local mode with server configuration of 64 GB RAM and 32 processor.if i have one url in the seedlist and have below configuration in nutchsite.xml

fetcher.threads.fetch =16
fetcher.threads.per.queue=2
fetcher.max.crawl.delay=120
fetcher.queue.depth.multiplier=150
fetcher.queue.mode=byHost

how many request will be made to the url in the Fetch phase if -topN is set to 1000 will multiple map task be created for Fetcher , what i understand is single map task is created irrespective of the number of urls that need to be fetched from fetchlist i tried googling the relation between fetcher.threads.fetch with fetcher.threads.per.queue but dint find anything that was clear also adding logs from fetcher Phase

FetcherThread INFO  fetcher.FetcherThread (277) - fetching 
http://investors.te.com/news-releases/press-release-details/2018/TE- 
Connectivity-announces-fourth-quarter-and-full-year-resu
lts-for-fiscal-year-2018/default.aspx (queue crawl delay=2000ms)
FetcherThread INFO  fetcher.FetcherThread (277) - fetching http://investors.te.com/shareholder-info/default.aspx (queue crawl delay=2000ms)
FetcherThread INFO  fetcher.FetcherThread (277) - fetching https://investors.te.com/news-releases/press-release-details/2019/TE-Connectivity-to-hold-annual-general-meeting-of-shareholders-on-March-13-2019/default.aspx (queue crawl delay=2000ms)
FetcherThread INFO  fetcher.FetcherThread (277) - fetching https://investors.te.com/investor-resources/request-information/default.aspx (queue crawl delay=2000ms)
FetcherThread INFO  fetcher.FetcherThread (277) - fetching https://investors.te.com/investor-resources/email-alerts/default.aspx (queue crawl delay=10000ms)
FetcherThread INFO  fetcher.FetcherThread (277) - fetching https://investors.te.com/site-map/default.aspx (queue crawl delay=10000ms)
FetcherThread INFO  fetcher.FetcherThread (277) - fetching https://investors.te.com/rss/PressRelease.aspx?LanguageId=1&CategoryWorkflowId=00000000-0000-0000-0000-000000000000&tags= (queue crawl delay=10000ms)
FetcherThread INFO  fetcher.FetcherThread (277) - fetching https://investors.te.com/stock-information/quote-and-chart/default.aspx (queue crawl delay=10000ms)
FetcherThread INFO  fetcher.FetcherThread (277) - fetching https://investors.te.com/investor-resources/overview/default.aspx (queue crawl delay=10000ms)
FetcherThread INFO  fetcher.FetcherThread (277) - fetching https://investors.te.com/investor-resources/investor-contacts/default.aspx (queue crawl delay=10000ms)
FetcherThread INFO  fetcher.FetcherThread (277) - fetching https://investors.te.com/js/mobileRedirect.js (queue crawl delay=10000ms)

Upvotes: 1

Views: 452

Answers (1)

Sebastian Nagel
Sebastian Nagel

Reputation: 2239

There will only a single request because there is only one URL. If there are two URLs from a single host with fetcher.threads.per.queue=2 there can be two simultaneous requests to the same host. A high number of fetcher.threads.fetch only makes sense if you have a large number of hosts to be crawled, or you're crawling your own local fast and responsive web server. In the latter case fetcher.threads.per.queue should be equal or close to fetcher.threads.fetch. If it's not your own server and you're not explicitly allowed you should always keep the default for fetcher.threads.per.queue which is a single thread (=1) with no parallel connections to the same host and a guaranteed delay between successive requests.

Upvotes: 1

Related Questions