Reputation: 12730
Commoncrawl datasets are splitted by segments. How to extract a subset of the common-crawl data-set? I need a WARC archive file (or several archive files) with all the files from a given domain, such as example.com?
Note: common_crawl_index allows to do that by running bin/remote_copy copy "com.ipc.www" --bucket commoncrawl_sample --key common_crawl/ipc_crawl
, but the project is outdated: it only works for 2012 datasets, and it does not accept WARC, WAT or WET files.
Note: Also, http://index.commoncrawl.org/ allows to find the segments for a given url prefix, but there is not a utility to download only that pages, such as the previous remote_copy
command.
PS: I am aware I can implement a program to do so. Here I am asking if common-crawl (or someone else) already thought and implemented this feature.
Upvotes: 4
Views: 457