Get an WARC achive file with all files from a given domain, using from commoncrawl.org

Question

Commoncrawl datasets are splitted by segments. How to extract a subset of the common-crawl data-set? I need a WARC archive file (or several archive files) with all the files from a given domain, such as example.com?

Note: common_crawl_index allows to do that by running bin/remote_copy copy "com.ipc.www" --bucket commoncrawl_sample --key common_crawl/ipc_crawl, but the project is outdated: it only works for 2012 datasets, and it does not accept WARC, WAT or WET files.

Note: Also, http://index.commoncrawl.org/ allows to find the segments for a given url prefix, but there is not a utility to download only that pages, such as the previous remote_copy command.

PS: I am aware I can implement a program to do so. Here I am asking if common-crawl (or someone else) already thought and implemented this feature.

Get an WARC achive file with all files from a given domain, using from commoncrawl.org

Answers (0)

Related Questions