Naveen
Naveen

Reputation: 455

Nutch crawl stopped after parsing one page

While crawling using nutch, it is parsed only one page and not moving forward. Can anyone please help. Below is the nutch output.

After parsing first page, it is stopping and not moving any further. Not parsed successfully.

[Naveen@01hw5189 apache-nutch-1.7]$ bin/nutch crawl urls -dir crawlwiki -depth 10 -topN 10
solrUrl is not set, indexing will be skipped...
crawl started in: crawlwiki
rootUrlDir = urls
threads = 10
depth = 10
solrUrl=null
topN = 10
Injector: starting at 2013-09-12 15:51:45
Injector: crawlDb: crawlwiki/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 1
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-09-12 15:51:47, elapsed: 00:00:02
Generator: starting at 2013-09-12 15:51:47
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 10
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawlwiki/segments/20130912155149
Generator: finished at 2013-09-12 15:51:50, elapsed: 00:00:03
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2013-09-12 15:51:50
Fetcher: segment: crawlwiki/segments/20130912155149
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
fetching http://en.wikipedia.org/ (queue crawl delay=5000ms)
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2013-09-12 15:51:53, elapsed: 00:00:03
ParseSegment: starting at 2013-09-12 15:51:53
ParseSegment: segment: crawlwiki/segments/20130912155149
ParseSegment: finished at 2013-09-12 15:51:54, elapsed: 00:00:01
CrawlDb update: starting at 2013-09-12 15:51:54
CrawlDb update: db: crawlwiki/crawldb
CrawlDb update: segments: [crawlwiki/segments/20130912155149]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2013-09-12 15:51:56, elapsed: 00:00:02
Generator: starting at 2013-09-12 15:51:56
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 10
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawlwiki/segments/20130912155159
Generator: finished at 2013-09-12 15:52:00, elapsed: 00:00:04
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2013-09-12 15:52:00
Fetcher: segment: crawlwiki/segments/20130912155159
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
fetching http://en.wikipedia.org/wiki/Main_Page (queue crawl delay=5000ms)
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Fetcher: throughput threshold: -1
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2013-09-12 15:52:02, elapsed: 00:00:02
ParseSegment: starting at 2013-09-12 15:52:02
ParseSegment: segment: crawlwiki/segments/20130912155159
Parsed (8ms):http://en.wikipedia.org/wiki/Main_Page

Upvotes: 2

Views: 1061

Answers (1)

andrew.butkus
andrew.butkus

Reputation: 777

check the robots.txt file for wikipedia at

http://en.wikipedia.org/robots.txt

the robots.txt may deny further depth search. the robot file defines what web crawlers can access, and Nutch complies with this 'netiquitte'

hope that helps

Upvotes: 1

Related Questions