Reputation: 77
So currently I have this code, and it works perfectly as I intended for it to work.
import urllib.request
from tqdm import tqdm
with open("output.txt", "r") as file:
itemIDS = [line.strip() for line in file]
x = 0
for length in tqdm(itemIDS):
urllib.request.urlretrieve(
"https://imagemocksite.com?id="+str(itemIDS[x]),
"images/"+str(itemIDS[x])+".jpg")
x += 1
print("All images downloaded")
I was searching around and the solutions I found weren't really what I was looking for. I have 200mbp/s so that's not my issue.
My issue is that my loop iterates 1.1 - 1.57 times per second. I want to make this faster as I have over 5k images I want to download. They are roughly 1-5kb each too.
Also if anyone has any code tips in general, I'd appreciate it! I'm learning python and it's pretty fun so I would like to get better wherever possible!
Edit: Using the info below about asyncio I am now getting 1.7-2.1 It/s which is better! Could it be faster? Maybe I used it wrong?
import urllib.request
from tqdm import tqdm
import asyncio
with open("output.txt", "r") as file:
itemIDS = [line.strip() for line in file]
async def download():
x = 0
for length in tqdm(itemIDS):
await asyncio.sleep(1)
urllib.request.urlretrieve(
"https://imagemocksite.com?id="+str(itemIDS[x]),
"images/"+str(itemIDS[x])+".jpg")
x += 1
asyncio.run(download())
print("All images downloaded")
Upvotes: 2
Views: 1604
Reputation: 2173
Comments have already provided good advice, and I think you're right to use asyncio
which is really the typical Python tool for that kind of job.
Just wanted to bring some help since the code you've provided doesn't really use its power.
First you'll have to install aiohttp
and aiofiles
that handle HTTP requests and local filesystem I/O asynchronously.
Then, define a download(item_id, session)
helper coroutine that downloads one single image based on its item_id
. session
will be a aiohttp.ClientSession
which is the base class to run async HTTP requests in aiohttp
.
The trick is finally to have a download_all
coroutine that calls asyncio.gather
on all the individual download()
coroutines at once. asyncio.gather
is the way to tell asyncio
to run several coroutines "in parallel".
This should massively speedup your downloads. If not, then it's the third-party server that is limiting you.
import asyncio
import aiohttp
import aiofiles
with open("output.txt", "r") as file:
itemIDS = [line.strip() for line in file]
async def download(item_id, session):
url = "https://imagemocksite.com"
filename = f"images/{item_id}.jpg"
async with session.get(url, {"id": item_id}) as response:
async with aiofiles.open(filename, "wb") as f:
await f.write(await response.read())
async def download_all():
async with aiohttp.ClientSession() as session:
await asyncio.gather(
*[download(item_id, session) for item_id in itemIDS]
)
asyncio.run(download_all())
print("All images downloaded")
Upvotes: 2