Reputation: 355
I have thousands of web pages of different websites. Is there a fast way to get content of all those web pages using Common Crawl and Python.
Below is code i am trying but this process is slow.
async def search_cc_index(url):
encoded_url = quote_plus(url)
index_url = f'{SERVER}{INDEX_NAME}-index?url={encoded_url}&output=json'
async with aiohttp.ClientSession() as session:
async with session.get(index_url) as response:
if response.status == 200:
records = (await response.text()).strip().split('\n')
return [json.loads(record) for record in records]
else:
return None
async def fetch_page_from_cc(records):
async with aiohttp.ClientSession() as session:
for record in records:
offset, length = int(record['offset']), int(record['length'])
s3_url = f'https://data.commoncrawl.org/{record["filename"]}'
byte_range = f'bytes={offset}-{offset + length - 1}'
async with session.get(s3_url, headers={'Range': byte_range}) as response:
if response.status == 206:
stream = ArchiveIterator(response.content)
for warc_record in stream:
if warc_record.rec_type == 'response':
return await warc_record.content_stream().read()
else:
return None
return None
async def fetch_individual_url(target_url):
records = await search_cc_index(target_url)
if records:
print(f"Found {len(records)} records for {target_url}")
content = await fetch_page_from_cc(records)
if content:
print(f"Successfully fetched content for {target_url}")
else:
print(f"No records found for {target_url}")
Upvotes: 1
Views: 44
Reputation: 409
That is a correct way to do it -- and yes, it's kind of slow.
To speed up the index lookup, you can use our columnar index instead of the cdx index.
You can also do multiple fetches in parallel, but please be sure to slow down if we send you a 503 reply. You can keep an eye on our performance graphs at https://status.commoncrawl.org/ to see if you're causing a problem, or if someone else is causing a problem.
Upvotes: 0