iiPython
iiPython

Reputation: 58

Fastest way to download a set of files in Python

The title is pretty self-explanatory.

So far I have tried using threads, multiprocessing, etc.

Below is my current code but I would like to know the absolute, fastest way possible.

import requests
import threading

def download(url):

  r = requests.get(url)

  with open(url, "wb") as f:
    f.write(r.content)

urls = [
  "https://google.com/favicon.ico",
  ...
]

for url in urls:

  threading.Thread(target = download, args = [url]).start()

Upvotes: 1

Views: 4675

Answers (2)

Hamza
Hamza

Reputation: 6055

A much faster solution! (200x faster to start downloads)

You can use asyncio. To start each download in a separate executor which will boost the speed. Plus it would be a lot faster than starting threads and will avoid pooling overhead of multiprocessing as well:

Slightly modified version of your code (cause urls have characters which you cnat use in filenames!):

%%timeit
import requests
import threading

def download(url, fn):
  r = requests.get(url)
  with open(str(fn), "wb") as f:
    f.write(r.content)

urls = [
  'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
  'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
  'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
  'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
  'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
  'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
  'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
  'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
  'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
  'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
  'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
  'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
  'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
  'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
  'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
]

for i,url in enumerate(urls):
    threading.Thread(target = download, args = [url, i]).start()

took: 5.91 ms ± 7.1 ms per loop (mean ± std. dev. of 7 runs, 100 loops each) As opposed to asyncio based version which only takes: 27.7 µs ± 4.19 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each):

%%timeit
import requests
import asyncio

def background(f):
    def wrapped(*args, **kwargs):
        return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs)
    return wrapped

@background
def download(url, fn):
  r = requests.get(url)
  with open(str(fn), "wb") as f:
    f.write(r.content)

urls = [
  'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
  'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
  'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
  'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
  'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
  'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
  'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
  'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
  'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
  'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
  'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
  'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
  'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
  'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
  'https://www.google.com/url?sa=i&url=https%3A%2F%2Fcityschool.org%2Fcampus%2Ffairmount%2Fhello%2F&psig=AOvVaw2gMH6tzY8psCcMab5FfG2u&ust=1605400822303000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJCrkKDmgO0CFQAAAAAdAAAAABAD'
]

for i,url in enumerate(urls):
    download(url,i)

Upvotes: 4

Czaporka
Czaporka

Reputation: 2407

Your current code looks pretty good except it is going to immediately spawn a thread for every URL, which might actually slow you down if you have a big number of URLs, because you're going to end up with too many threads.
Check this out: How many threads is too many?

I suggest using multiprocessing.pool.ThreadPool or multiprocessing.pool.Pool which allow you to set the maximum number of active threads/processes. I would probably go with ThreadPool even though threads in Python are limited to a single CPU core, but they don't have the overhead of creating a new process, and your HDD is likely to be a bottleneck anyway.

from multiprocessing.pool import ThreadPool

import requests

MAX_THREADS = 100  # <--- tweak this

urls = [
  "https://google.com/favicon.ico",
  ...
]

def download(url):
    r = requests.get(url)
    with open(url, "wb") as f:
        f.write(r.content)

if __name__ == "__main__":
    with ThreadPool(MAX_THREADS) as p:
        p.map(download, urls)

Upvotes: 4

Related Questions