Reputation: 23
Common Crawl is a non-profit third party web search engine. http://commoncrawl.org
I'm seeing the API to search Common Crawl for a given domain.
How can I search common crawl for a given search term?
Upvotes: 1
Views: 1239
Reputation: 4864
you can't currently search the content of the web pages. There was commonsearch which used the CC datasets but I am not sure how up to date it is. If you are looking for a limited set of keywords you could use Mapreduce or Spark to filter the pages but if you are dealing with an open or arbitrary set of queries then the best approach would be to index the datasets into Elasticsearch or SOLR yourself.
Upvotes: 3