Reputation: 301
I have crawled a list of websites using Nutch 1.12. I can dump the crawl data into separate HTML files by using:
./bin/nutch dump -segment crawl/segments/ -o outputDir nameOfDir
And into a single WARC file by using:
./bin/nutch warc crawl/warcs crawl/segment/nameOfSegment
But how can I dump the collected data into multiple WARC files, one for each webpage crawled?
Upvotes: 2
Views: 329
Reputation: 301
After quite a few attempts, I managed to find out that
./bin/nutch commoncrawldump -outputDir nameOfOutputDir -segment crawl/segments/segmentDir -warc
does exactly what I needed: a full dump of the segment into individual WARC files!
Upvotes: 1
Reputation: 4864
Sounds a bit wasteful to have one WARC per doc but here you go : you could specify a low value for 'warc.output.segment.size' so that the files get rotated every time a new document is written. WarcExporter uses [https://github.com/ept/warc-hadoop] under the bonnet, the config is used there.
Upvotes: 0