Cristian Tozuna
Cristian Tozuna

Reputation: 33

Are threads faster than asyncio?

I'm working on a project that parses data from a lot of websites. Most of my code is done, so i'm looking forward to use asyncio in order to eliminate that I/O waiting, but still i wanted to test how threading would work, better or worse. To do that, i wrote some simple code to make requests to 100 websites. Btw i'm using requests_html library for that, fortunately it supports asynchronous requests as well.

asyncio code looks like:

import requests
import time

from requests_html import AsyncHTMLSession

aio_session = AsyncHTMLSession()
urls = [...] # 100 urls


async def fetch(url):
    try:
        response = await aio_session.get(url, timeout=5)
        status = 200
    except requests.exceptions.ConnectionError:
        status = 404
    except requests.exceptions.ReadTimeout:
        status = 408
   
    if status == 200:
        return {
            'url': url,
            'status': status,
            'html': response.html
        }

    return {
        'url': url,
        'status': status
    }


    
def extract_html(urls):
    tasks = []

    for url in urls:
        tasks.append(lambda url=url: fetch(url))
        
    websites = aio_session.run(*tasks)
    
    return websites


if __name__ == "__main__":
    start_time = time.time()
    websites = extract_html(urls)
    print(time.time() - start_time)

Execution time (multiple tests):

13.466366291046143
14.279950618743896
12.980706453323364

BUT If i run an example with threading:

from queue import Queue
import requests
from requests_html import HTMLSession
from threading import Thread
import time

num_fetch_threads = 50
enclosure_queue = Queue()

html_session = HTMLSession()
urls = [...] # 100 urls


def fetch(i, q):
    while True:
        url = q.get()
        try:
            response = html_session.get(url, timeout=5)
            status = 200
        except requests.exceptions.ConnectionError:
            status = 404
        except requests.exceptions.ReadTimeout:
            status = 408

        q.task_done()


if __name__ == "__main__":
    for i in range(num_fetch_threads):
        worker = Thread(target=fetch, args=(i, enclosure_queue,))
        worker.setDaemon(True)
        worker.start()

    start_time = time.time()
    for url in urls:
        enclosure_queue.put(url)

    enclosure_queue.join()

    print(time.time() - start_time)

Execution time (multiple tests):

7.476433515548706
6.786043643951416
6.717151403427124

The thing that i don't understand .. both libraries are used against I/O problems, but why are threads faster ? The more i increase the number of threads, the more resources it uses but it's a lot faster.. Can someone please explain to me why are threads faster than asyncio in my example ?

Thanks in advance.

Upvotes: 2

Views: 1280

Answers (1)

Vincent
Vincent

Reputation: 13415

It turns out requests-html uses a pool of threads for running the requests. The default number of threads is the number of core on the machine multiplied by 5. This probably explains the difference in performance you noticed.

You might want to try the experiment again using aiohttp instead. In the case of aiohttp, the underlying socket for the HTTP connection is actually registered in the asyncio event loop, so no threads should be involved here.

Upvotes: 5

Related Questions