mythos917
mythos917

Reputation: 11

Parallel downloading with urlretrieve

I regularly have to download and rename HTML pages in bulk and wrote this simple code for it a while ago:

import shutil
import os
import sys
import socket

socket.setdefaulttimeout(5)   

file_read = open(my_file, "r")
lines = file_read.readlines()
for line in lines:
    try:
        import urllib.request 
        sl = line.strip().split(";")
        url = sl[0]
        newname = str(sl[1])+".html"
        urllib.request.urlretrieve(url, newname)
    except:
        pass
file_read.close()

This works well enough for a few hundred websites, but takes waaaaay too long for a larger number of downloads (20-50k). What would be the simplest and best way to speed it up?

Upvotes: 1

Views: 529

Answers (1)

user3666197
user3666197

Reputation: 1

Q :
" I regularly have to ...
What would be the simplest and best way to speed it up ? "

A :
The SIMPLEST ( what the commented approach is not ) &
the BEST way
is to at least :
(a)
minimise all overheads ( 50k times Thread-instantiation costs being one such class of costs ),
(b)
harness embarrasing independence ( yet, not a being a True-[PARALLEL] ) in process-flow
(c)
go as close as possible to bleeding edges of a just-[CONCURRENT], latency-masked process-flow

Given
both the simplicity & performance seem to be the measure of "best"-ness:

Any costs, that do not first justify the costs of introducing themselves by so much increased performance, and second, that do not create additional positive net-effect on performance ( speed-up ) are performance ANTI-patterns & unforgivable Computer Science sins.

Therefore
I could not promote using GIL-lock (by-design even a just-[CONCURRENT]-processing prevented) bound & performance-suffocated step-by-step round-robin stepping of any amount of Python-threads in a one-after-another-after-another-...-re-[SERIAL]-ised chain of about 100 [ms]-quanta of code-interpretation time-blocks a one and only one such Python-thread is being let to run ( where all others are blocked-waiting ... being rather a performance ANTI-pattern, isn't it? ),
so
rather go in for process-based concurrency of work-flow ( performance gains a lot here, for ~ 50k url-fetches, where large hundreds / thousands of [ms]-latencies ( protocol-and-security handshaking setup + remote url-decode + remote content-assembly + remote content-into- protocol-encapsulation + remote-to-local network-flows + local protocol-decode + ... ).

Sketched process-flow framework :

from joblib import Parallel, delayed

MAX_WORKERs = ( n_CPU_cores - 1 )

def main( files_in ):
    """                                                     __doc__
    .INIT worker-processes, each with a split-scope of tasks
    """
    IDs = range( max( 1, MAX_WORKERs ) )
    RES_if_need = Parallel( n_jobs = MAX_WORKERs
                            )(       delayed( block_processor_FUN #-- fun CALLABLE
                                              )( my_file, #---------- fun PAR1
                                                 wPROC    #---------- fun PAR2
                                                 )
                                              for wPROC in IDs
                                     )

def block_processor_FUN( file_with_URLs = None,
                         file_from_PART = 0
                         ):
    """                                                     __doc__
    .OPEN file_with_URLs
    .READ file_from_PART, row-wise - till next part starts
                                   - ref. global MAX_WORKERs
    """
    ...

This is the initial Python-interpreter __main__-side trick to spawn just-enough worker-processes, that start crawling the my_file-"list" of URL-s independently AND an indeed just-[CONCURENT]-flow of work starts, one being independent of any other.

The block_processor_FUN(), passed by reference to the workers does simlpy open the file, and starts fetching/processing only its "own"-fraction, being from ( wPROC / MAX_WORKERs ) to ( ( wPROC + 1 ) / MAX_WORKERs ) of it's number of lines.

That simple.

If willing to tune-up corner-cases, where some URL may take and takes longer, than one may improve the form of load-balancing fair-queueing, yet at a cost of more complex design ( many process-to-process messaging queues are available ), having a { __main__ | main() }-side FQ/LB-feeder and making worker-processes retrieve their next task from such job-request FQ/LB-facility.

More complex & more robust to uneven distribution of URL-serving durations "across" the my_file-ordered list of URL-s to serve.

The choices of levels of simplicity / complexity compromises, that impact the resulting performance / robustness are yours.

For more details you may like to read this and code from this and there directed examples or tips for further performace-boosting.

Upvotes: 1

Related Questions