Reputation: 111
I am trying to get urls, titles and languages from webpages. Fortunately there exists the CC API https://github.com/webrecorder/pywb/wiki/CDX-Server-API#api-reference. But sadly I did not notice a way to get also the titles.
At the moment I query CC as (for example) http://index.commoncrawl.org/CC-MAIN-2018-47-index?url=www.example.com/*&output=json where I get "url" and "languages" information.
Is there any way to query CC through the API without downloading every warc and getting the titles?
Thanks!
Upvotes: 1
Views: 202
Reputation: 2239
No. The page title isn't indexed in Common Crawl's URL index (neither in the CDX index nor the columnar index).
Upvotes: 3