ALTAF HUSSAIN
ALTAF HUSSAIN

Reputation: 355

Fetch Page Content from Common Crawl

I have thousands of web pages of different websites. Is there a fast way to get content of all those web pages using Common Crawl and Python.

Below is code i am trying but this process is slow.

async def search_cc_index(url):
    encoded_url = quote_plus(url)
    index_url = f'{SERVER}{INDEX_NAME}-index?url={encoded_url}&output=json'
    async with aiohttp.ClientSession() as session:
        async with session.get(index_url) as response:
            if response.status == 200:
                records = (await response.text()).strip().split('\n')
                return [json.loads(record) for record in records]
            else:
                return None


async def fetch_page_from_cc(records):
    async with aiohttp.ClientSession() as session:
        for record in records:
            offset, length = int(record['offset']), int(record['length'])
            s3_url = f'https://data.commoncrawl.org/{record["filename"]}'

            byte_range = f'bytes={offset}-{offset + length - 1}'

            async with session.get(s3_url, headers={'Range': byte_range}) as response:
                if response.status == 206:
                    stream = ArchiveIterator(response.content)
                    for warc_record in stream:
                        if warc_record.rec_type == 'response':
                            return await warc_record.content_stream().read()
                else:
                    return None

    return None


async def fetch_individual_url(target_url):
    records = await search_cc_index(target_url)
    if records:
        print(f"Found {len(records)} records for {target_url}")
        content = await fetch_page_from_cc(records)
        if content:
            print(f"Successfully fetched content for {target_url}")
    else:
        print(f"No records found for {target_url}")

Upvotes: 1

Views: 44

Answers (1)

Greg Lindahl
Greg Lindahl

Reputation: 409

That is a correct way to do it -- and yes, it's kind of slow.

To speed up the index lookup, you can use our columnar index instead of the cdx index.

You can also do multiple fetches in parallel, but please be sure to slow down if we send you a 503 reply. You can keep an eye on our performance graphs at https://status.commoncrawl.org/ to see if you're causing a problem, or if someone else is causing a problem.

Upvotes: 0

Related Questions