saladzic
saladzic

Reputation: 56

Python aiohttp (with asyncio) sends requests very slowly

Situation: I am trying to send a HTTP request to all listed domains in a specific file I already downloaded and get the destination URL, I was forwarded to.

Problem: Well I have followed a tutorial and I get many less responses than expected. It's around 100 responses per second, but in the tutorial there are 100,000 responses per minute listed. The script gets also slower and slower after a couple of seconds, so that I just get 1 response every 5 seconds.

Already tried: Firstly I thought that this problem is because I ran that on a Windows server. Well after I tried the script on my computer, I recognized that it was just a little bit faster, but not much more. On an other Linux server it was the same like on my computer (Unix, macOS).

Code: https://pastebin.com/WjLegw7K

work_dir = os.path.dirname(__file__)

async def fetch(url, session):
    try:
        async with session.get(url, ssl=False) as response:
            if response.status == 200:
                delay = response.headers.get("DELAY")
                date = response.headers.get("DATE")
                print("{}:{} with delay {}".format(date, response.url, delay))
                return await response.read()
    except Exception:
        pass

async def bound_fetch(sem, url, session):
    # Getter function with semaphore.
    async with sem:
        await fetch(url, session)


async def run():
    os.chdir(work_dir)
    for file in glob.glob("cdx-*"):
        print("Opening: " + file)
        opened_file = file
        tasks = []
        # create instance of Semaphore
        sem = asyncio.Semaphore(40000)
        with open(work_dir + '/' + file) as infile:
            seen = set()
            async with ClientSession() as session:
                for line in infile:
                    regex = re.compile(r'://(.*?)/')
                    domain = regex.search(line).group(1)
                    domain = domain.lower()

                    if domain not in seen:
                        seen.add(domain)

                        task = asyncio.ensure_future(bound_fetch(sem, 'http://' + domain, session))
                        tasks.append(task)

                    del line
                responses = asyncio.gather(*tasks)
                await responses
            infile.close()
            del seen
            del file


loop = asyncio.get_event_loop()

future = asyncio.ensure_future(run())
loop.run_until_complete(future)

I really don't know how to fix that issue. Especially because I'm very new to Python... but I have to get it to work somehow :(

Upvotes: 2

Views: 2481

Answers (1)

user4815162342
user4815162342

Reputation: 155555

It's hard to tell what is going wrong without actually debugging the code, but one potential problem is that file processing is serialized. In other words, the code never processes the next file until all the requests from the current file have finished. If there are many files and one of them is slow, this could be a problem.

To change this, define run along these lines:

async def run():
    os.chdir(work_dir)
    async with ClientSession() as session:
        sem = asyncio.Semaphore(40000)
        seen = set()
        pending_tasks = set()
        for f in glob.glob("cdx-*"):
            print("Opening: " + f)
            with open(f) as infile:
                lines = list(infile)
            for line in lines:
                domain = re.search(r'://(.*?)/', line).group(1)
                domain = domain.lower()
                if domain in seen:
                    continue
                seen.add(domain)
                task = asyncio.ensure_future(bound_fetch(sem, 'http://' + domain, session))
                pending_tasks.add(task)
                # ensure that each task removes itself from the pending set
                # when done, so that the set doesn't grow without bounds
                task.add_done_callback(pending_tasks.remove)
        # await the remaining tasks
        await asyncio.wait(pending_tasks)

Another important thing: silencing all exceptions in fetch() is bad practice because there is no indication that something has started going wrong (due to either a bug or a simple typo). This might well be the reason your script becomes "slow" after a while - fetch is raising exceptions and you're never seeing them. Instead of pass, use something like print(f'failed to get {url}: {e}') where e is the object you get from except Exception as e.

Several additional remarks:

  • There is almost never a need to del local variables in Python; the garbage collector does that automatically.
  • You needn't close() a file opened using a with statement. with is designed specifically to do such closing automatically for you.
  • The code added domains to a seen set, but also processed an already seen domain. This version skips the domain for which it had already spawned a task.
  • You can create a single ClientSession and use it for the entire run.

Upvotes: 2

Related Questions