Reputation: 53
I have a question regarding viewing data in crawldb/segments
folder. I see there is a content/part-00000
folder in segment folder. How do I dump the data (or view the data)?
This is what I am seeing when is type esc :%!xxd
in the binary file (I removed the hex codes)
SEQ.org.apache.hadoop.io.Text
org.apache.nutch.parse.ParseText.
.org.apache.hadoop.io.compress.
DefaultCodec http://localhost:8001/a.html
and more characters like this.
It does not make much sense. This does not look like the data I have on the local page. Is there another way of looking at this or should I be looking at a different place?
Upvotes: 1
Views: 1049
Reputation: 782
Run the following command from nutch home:
bin/nutch readseg -dump crawl/segments/your_segment output -nofetch -noparse -noparsetext
To know what commands you can use with Nutch, try to run
bin/nutch
I hope that helps.
Upvotes: 1