user3523860
user3523860

Reputation: 25

Unable to verify crawled data stored in hbase

I have crawled website using 'nutch' with HBase as a storage back-end. I have referred this tutorial link- http://wiki.apache.org/nutch/Nutch2Tutorial.

Nutch version is 2.2.1, HBase version 0.90.4 and Solr version 4.7.1

Here are the steps I used-

./runtime/local/bin/nutch inject urls

./runtime/local/bin/nutch generate -topN 100 -adddays 30

./runtime/local/bin/nutch fetch -all

./runtime/local/bin/nutch fetch -all

./runtime/local/bin/nutch updatedb

./runtime/local/bin/nutch solrindex http://localhost:8983/solr/ -all

My url/seed.txt file contains- http://www.xyzshoppingsite.com/mobiles/

And I have kept only below line in 'regex-urlfilter.txt' file (all other regex are commented).

+^http://([a-z0-9]*\.)*xyzshoppingsite.com/mobile/*

At the end of the crawl, I can see a table "webpage" created in the HBase but I am unable to verify whether all and complete data have been crawled or not. When searched in Solr, it shows nothing, 0 result.

My ultimate intention is to get the complete data present in all pages under mobile in above URL.

Could you please let me know,

Here, why is the data crawled like this and not the actual contents of page after redirection?

Please help. Thanks in advance.

Thanks and Regards!

Upvotes: 1

Views: 317

Answers (1)

sreemanth pulagam
sreemanth pulagam

Reputation: 953

Instead of executing all those steps, can you use below command

./bin/crawl url/seed.txt shoppingcrawl http://localhost:8080/solr 2

If you are able to execute successfully, a table will be created in hbase , with name, shoppingcrawl_webpage.

we can check by executing below command in hbase shell

hbase> list

Then we can scan for specific table. In this case

 hbase> scan 'shoppingcrawl_webpage'

Upvotes: 0

Related Questions