Unable to verify crawled data stored in hbase

Question

I have crawled website using 'nutch' with HBase as a storage back-end. I have referred this tutorial link- http://wiki.apache.org/nutch/Nutch2Tutorial.

Nutch version is 2.2.1, HBase version 0.90.4 and Solr version 4.7.1

Here are the steps I used-

./runtime/local/bin/nutch inject urls

./runtime/local/bin/nutch generate -topN 100 -adddays 30

./runtime/local/bin/nutch fetch -all

./runtime/local/bin/nutch updatedb

./runtime/local/bin/nutch solrindex http://localhost:8983/solr/ -all

My url/seed.txt file contains- http://www.xyzshoppingsite.com/mobiles/

And I have kept only below line in 'regex-urlfilter.txt' file (all other regex are commented).

+^http://([a-z0-9]*\.)*xyzshoppingsite.com/mobile/*

At the end of the crawl, I can see a table "webpage" created in the HBase but I am unable to verify whether all and complete data have been crawled or not. When searched in Solr, it shows nothing, 0 result.

My ultimate intention is to get the complete data present in all pages under mobile in above URL.

Could you please let me know,

How to verify crawled data present in HBase?
Solr log directory contains 0 files so I am unable to get a breakthrough. How to resolve this?
Output of HBase command scan "webpage" shows only timestamp data and other data as

'value=\x0A\x0APlease Wait ... Redirecting to http://www.xyzshoppingsite.com/mobilesPlease Wait ... Redirecting to http://www.xyzshoppingsite.com/mobiles'

Here, why is the data crawled like this and not the actual contents of page after redirection?

Please help. Thanks in advance.

Thanks and Regards!

Unable to verify crawled data stored in hbase

Answers (1)

Related Questions