user2880734
user2880734

Reputation: 53

Viewing data in nutch crawl/segment folder

I have a question regarding viewing data in crawldb/segments folder. I see there is a content/part-00000 folder in segment folder. How do I dump the data (or view the data)?

This is what I am seeing when is type esc :%!xxd in the binary file (I removed the hex codes)

SEQ.org.apache.hadoop.io.Text 
org.apache.nutch.parse.ParseText.
.org.apache.hadoop.io.compress. 
DefaultCodec http://localhost:8001/a.html 

and more characters like this.

It does not make much sense. This does not look like the data I have on the local page. Is there another way of looking at this or should I be looking at a different place?

Upvotes: 1

Views: 1049

Answers (1)

aalbahem
aalbahem

Reputation: 782

Run the following command from nutch home:

bin/nutch readseg -dump crawl/segments/your_segment output -nofetch -noparse -noparsetext

To know what commands you can use with Nutch, try to run

bin/nutch

I hope that helps.

Upvotes: 1

Related Questions