unsettled_duck
unsettled_duck

Reputation: 77

Python - Looping through large list and downloading images quickly

So currently I have this code, and it works perfectly as I intended for it to work.

import urllib.request
from tqdm import tqdm

with open("output.txt", "r") as file:
    itemIDS = [line.strip() for line in file]

x = 0

for length in tqdm(itemIDS):
    urllib.request.urlretrieve(
        "https://imagemocksite.com?id="+str(itemIDS[x]), 
        "images/"+str(itemIDS[x])+".jpg")
    x += 1

print("All images downloaded")

I was searching around and the solutions I found weren't really what I was looking for. I have 200mbp/s so that's not my issue.

My issue is that my loop iterates 1.1 - 1.57 times per second. I want to make this faster as I have over 5k images I want to download. They are roughly 1-5kb each too.

Also if anyone has any code tips in general, I'd appreciate it! I'm learning python and it's pretty fun so I would like to get better wherever possible!

Edit: Using the info below about asyncio I am now getting 1.7-2.1 It/s which is better! Could it be faster? Maybe I used it wrong?

import urllib.request
from tqdm import tqdm
import asyncio

with open("output.txt", "r") as file:
    itemIDS = [line.strip() for line in file]

async def download():
    x = 0
    for length in tqdm(itemIDS):
        await asyncio.sleep(1)
        urllib.request.urlretrieve(
            "https://imagemocksite.com?id="+str(itemIDS[x]), 
            "images/"+str(itemIDS[x])+".jpg")
        x += 1

asyncio.run(download())
print("All images downloaded")

Upvotes: 2

Views: 1604

Answers (1)

Roméo Després
Roméo Després

Reputation: 2173

Comments have already provided good advice, and I think you're right to use asyncio which is really the typical Python tool for that kind of job.

Just wanted to bring some help since the code you've provided doesn't really use its power.

First you'll have to install aiohttp and aiofiles that handle HTTP requests and local filesystem I/O asynchronously.

Then, define a download(item_id, session) helper coroutine that downloads one single image based on its item_id. session will be a aiohttp.ClientSession which is the base class to run async HTTP requests in aiohttp.

The trick is finally to have a download_all coroutine that calls asyncio.gather on all the individual download() coroutines at once. asyncio.gather is the way to tell asyncio to run several coroutines "in parallel".

This should massively speedup your downloads. If not, then it's the third-party server that is limiting you.

import asyncio

import aiohttp
import aiofiles


with open("output.txt", "r") as file:
    itemIDS = [line.strip() for line in file]


async def download(item_id, session):
    url = "https://imagemocksite.com"
    filename = f"images/{item_id}.jpg"
    async with session.get(url, {"id": item_id}) as response:
        async with aiofiles.open(filename, "wb") as f:
            await f.write(await response.read())


async def download_all():
    async with aiohttp.ClientSession() as session:
        await asyncio.gather(
            *[download(item_id, session) for item_id in itemIDS]
        )


asyncio.run(download_all())
print("All images downloaded")

Upvotes: 2

Related Questions