Reputation: 58
The title is pretty self-explanatory.
So far I have tried using threads, multiprocessing, etc.
Below is my current code but I would like to know the absolute, fastest way possible.
import requests
import threading
def download(url):
r = requests.get(url)
with open(url, "wb") as f:
f.write(r.content)
urls = [
"https://google.com/favicon.ico",
...
]
for url in urls:
threading.Thread(target = download, args = [url]).start()
Upvotes: 1
Views: 4675
Reputation: 6055
You can use asyncio. To start each download in a separate executor which will boost the speed. Plus it would be a lot faster than starting threads and will avoid pooling overhead of multiprocessing as well:
Slightly modified version of your code (cause urls have characters which you cnat use in filenames!):
%%timeit
import requests
import threading
def download(url, fn):
r = requests.get(url)
with open(str(fn), "wb") as f:
f.write(r.content)
urls = [
'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
]
for i,url in enumerate(urls):
threading.Thread(target = download, args = [url, i]).start()
took: 5.91 ms ± 7.1 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
As opposed to asyncio based version which only takes: 27.7 µs ± 4.19 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
:
%%timeit
import requests
import asyncio
def background(f):
def wrapped(*args, **kwargs):
return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs)
return wrapped
@background
def download(url, fn):
r = requests.get(url)
with open(str(fn), "wb") as f:
f.write(r.content)
urls = [
'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
]
for i,url in enumerate(urls):
download(url,i)
Upvotes: 4
Reputation: 2407
Your current code looks pretty good except it is going to immediately spawn a thread for every URL, which might actually slow you down if you have a big number of URLs, because you're going to end up with too many threads.
Check this out: How many threads is too many?
I suggest using multiprocessing.pool.ThreadPool
or multiprocessing.pool.Pool
which allow you to set the maximum number of active threads/processes. I would probably go with ThreadPool
even though threads in Python are limited to a single CPU core, but they don't have the overhead of creating a new process, and your HDD is likely to be a bottleneck anyway.
from multiprocessing.pool import ThreadPool
import requests
MAX_THREADS = 100 # <--- tweak this
urls = [
"https://google.com/favicon.ico",
...
]
def download(url):
r = requests.get(url)
with open(url, "wb") as f:
f.write(r.content)
if __name__ == "__main__":
with ThreadPool(MAX_THREADS) as p:
p.map(download, urls)
Upvotes: 4