Deep Lotia
Deep Lotia

Reputation: 41

Nutch : Crawl Broken Links & Index it in Solr

My purpose is to find how many URLs in an HTML page are invalid (404, 500, HostNotFound). So in Nutch is there a config change that we can do through which the web crawler crawls through broken links and indexes it in solr.

Once all the broken links & valid links are indexed in Solr I can just check the URLs which are invalid and can remove it from my HTML page.

Any help will be highly appreciated.

Thanks in advance.

Upvotes: 1

Views: 1403

Answers (2)

Samir
Samir

Reputation: 38

This command will give you a dump of just the broken links:

bin/nutch readdb <crawlFolder>/crawldb/ -dump myDump -status db_gone

Remember to exclude URLs with the following tag in the dump since it is generated from respecting robots.txt:

Metadata: _pst_=robots_denied(18)

Upvotes: 0

Diaa
Diaa

Reputation: 879

You don't need to index to solr to find out broken links. Do the following:

bin/nutch readdb <crawlFolder>/crawldb/ -dump myDump

It will give you the links that are 404 as:

Status: 3 (db_gone)
Metadata: _pst_: notfound(14)

go through the output file and you'll find all broken links.

Example:

  1. Put in the url file "http://www.wikipedia.com/somethingUnreal http://en.wikipedia.org/wiki/NocontentPage"
  2. Run the crawl command:bin/nutch crawl urls.txt -depth 1
  3. Run the readdb command:bin/nutch readdb crawl-20140214115539/crawldb/ -dump mydump
  4. Open the output file "part-xxxxx" with a text editor

Results:

http://en.wikipedia.org/wiki/NocontentPage  Version: 7
Status: 1 (db_unfetched)
...
Metadata: _pst_: exception(16), lastModified=0: Http code=503, url=http://en.wikipedia.org/wiki/NocontentPage

http://www.wikipedia.com/somethingUnreal    Version: 7
Status: 5 (db_redir_perm)
...
Metadata: Content-Type: text/html_pst_: moved(12), lastModified=0: http://www.wikipedia.org/somethingUnreal

Upvotes: 1

Related Questions