Milo Knell
Milo Knell

Reputation: 164

Where to put BeautifulSoup code in Asyncio Web Scraping Application

I need to scrape and get the raw text of the body paragraphs for many (5-10k per day) news articles. I've written some threading code, but given the highly I/O bound nature of this project I am dabbling in asyncio. The code snippet below is not any faster than a 1-threaded version, and far worse than my threaded version. Could anyone tell me what I am doing wrong? Thank you!

async def fetch(session,url):
    async with session.get(url) as response:
        return await response.text()

async def scrape_urls(urls):
    results = []
    tasks = []
    async with aiohttp.ClientSession() as session:
        for url in urls:
            html = await fetch(session,url)
            soup = BeautifulSoup(html,'html.parser')
            body = soup.find('div', attrs={'class':'entry-content'})
            paras = [normalize('NFKD',para.get_text()) for para in body.find_all('p')]
            results.append(paras)
    return results

Upvotes: 5

Views: 4781

Answers (1)

user4815162342
user4815162342

Reputation: 154926

await means "wait until the result is ready", so when you await the fetching in each loop iteration, you request (and get) sequential execution. To parallelize fetching, you need to spawn each fetch into a background tasks using something like asyncio.create_task(fetch(...)), and then await them, similar to how you'd do it with threads. Or even more simply, you can let the asyncio.gather convenience function do it for you. For example (untested):

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

def parse(html):
    soup = BeautifulSoup(html,'html.parser')
    body = soup.find('div', attrs={'class':'entry-content'})
    return [normalize('NFKD',para.get_text())
            for para in body.find_all('p')]

async def fetch_and_parse(session, url):
    html = await fetch(session, url)
    paras = parse(html)
    return paras

async def scrape_urls(urls):
    async with aiohttp.ClientSession() as session:
        return await asyncio.gather(
            *(fetch_and_parse(session, url) for url in urls)
        )

If you find that this still runs slower than the multi-threaded version, it is possible that the parsing of HTML is slowing down the IO-related work. (Asyncio runs everything in a single thread by default.) To prevent CPU-bound code from interfering with asyncio, you can move the parsing to a separate thread using run_in_executor:

async def fetch_and_parse(session, url):
    html = await fetch(session, url)
    loop = asyncio.get_event_loop()
    # run parse(html) in a separate thread, and
    # resume this coroutine when it completes
    paras = await loop.run_in_executor(None, parse, html)
    return paras

Note that run_in_executor must be awaited because it returns an awaitable that is "woken up" when the background thread completes the given assignment. As this version uses asyncio for IO and threads for parsing, it should run about as fast as your threaded version, but scale to a much larger number of parallel downloads.

Finally, if you want the parsing to run actually in parallel, using multiple cores, you can use multi-processing instead:

_pool = concurrent.futures.ProcessPoolExecutor()

async def fetch_and_parse(session, url):
    html = await fetch(session, url)
    loop = asyncio.get_event_loop()
    # run parse(html) in a separate process, and
    # resume this coroutine when it completes
    paras = await loop.run_in_executor(pool, parse, html)
    return paras

Upvotes: 13

Related Questions