Reputation: 25
i am running nutch in local mode with server configuration of 64 GB RAM and 32 processor.if i have one url in the seedlist and have below configuration in nutchsite.xml
fetcher.threads.fetch =16
fetcher.threads.per.queue=2
fetcher.max.crawl.delay=120
fetcher.queue.depth.multiplier=150
fetcher.queue.mode=byHost
how many request will be made to the url in the Fetch phase if -topN is set to 1000 will multiple map task be created for Fetcher , what i understand is single map task is created irrespective of the number of urls that need to be fetched from fetchlist i tried googling the relation between fetcher.threads.fetch with fetcher.threads.per.queue but dint find anything that was clear also adding logs from fetcher Phase
FetcherThread INFO fetcher.FetcherThread (277) - fetching
http://investors.te.com/news-releases/press-release-details/2018/TE-
Connectivity-announces-fourth-quarter-and-full-year-resu
lts-for-fiscal-year-2018/default.aspx (queue crawl delay=2000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching http://investors.te.com/shareholder-info/default.aspx (queue crawl delay=2000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/news-releases/press-release-details/2019/TE-Connectivity-to-hold-annual-general-meeting-of-shareholders-on-March-13-2019/default.aspx (queue crawl delay=2000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/investor-resources/request-information/default.aspx (queue crawl delay=2000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/investor-resources/email-alerts/default.aspx (queue crawl delay=10000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/site-map/default.aspx (queue crawl delay=10000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/rss/PressRelease.aspx?LanguageId=1&CategoryWorkflowId=00000000-0000-0000-0000-000000000000&tags= (queue crawl delay=10000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/stock-information/quote-and-chart/default.aspx (queue crawl delay=10000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/investor-resources/overview/default.aspx (queue crawl delay=10000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/investor-resources/investor-contacts/default.aspx (queue crawl delay=10000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/js/mobileRedirect.js (queue crawl delay=10000ms)
Upvotes: 1
Views: 452
Reputation: 2239
There will only a single request because there is only one URL. If there are two URLs from a single host with fetcher.threads.per.queue=2
there can be two simultaneous requests to the same host. A high number of fetcher.threads.fetch
only makes sense if you have a large number of hosts to be crawled, or you're crawling your own local fast and responsive web server. In the latter case fetcher.threads.per.queue
should be equal or close to fetcher.threads.fetch
. If it's not your own server and you're not explicitly allowed you should always keep the default for fetcher.threads.per.queue
which is a single thread (=1) with no parallel connections to the same host and a guaranteed delay between successive requests.
Upvotes: 1