Reputation: 11
I regularly have to download and rename HTML pages in bulk and wrote this simple code for it a while ago:
import shutil
import os
import sys
import socket
socket.setdefaulttimeout(5)
file_read = open(my_file, "r")
lines = file_read.readlines()
for line in lines:
try:
import urllib.request
sl = line.strip().split(";")
url = sl[0]
newname = str(sl[1])+".html"
urllib.request.urlretrieve(url, newname)
except:
pass
file_read.close()
This works well enough for a few hundred websites, but takes waaaaay too long for a larger number of downloads (20-50k). What would be the simplest and best way to speed it up?
Upvotes: 1
Views: 529
Reputation: 1
Q :
" I regularly have to ...
What would be the simplest and best way to speed it up ? "
A :
The SIMPLEST ( what the commented approach is not ) &
the BEST way
is to at least :
(a)
minimise all overheads ( 50k times Thread
-instantiation costs being one such class of costs ),
(b)
harness embarrasing independence ( yet, not a being a True-[PARALLEL]
) in process-flow
(c)
go as close as possible to bleeding edges of a just-[CONCURRENT]
, latency-masked process-flow
Given
both the simplicity & performance seem to be the measure of "best"-ness:
Any costs, that do not first justify the costs of introducing themselves by so much increased performance, and second, that do not create additional positive net-effect on performance ( speed-up ) are performance ANTI-patterns & unforgivable Computer Science sins.
Therefore
I could not promote using GIL-lock (by-design even a just-[CONCURRENT]
-processing prevented) bound & performance-suffocated step-by-step round-robin stepping of any amount of Python-threads in a one-after-another-after-another-...-re-[SERIAL]
-ised chain of about 100 [ms]
-quanta of code-interpretation time-blocks a one and only one such Python-thread is being let to run ( where all others are blocked-waiting ... being rather a performance ANTI-pattern, isn't it? ),
so
rather go in for process-based concurrency of work-flow ( performance gains a lot here, for ~ 50k url-fetches, where large hundreds / thousands of [ms]-latencies ( protocol-and-security handshaking setup + remote url-decode + remote content-assembly + remote content-into- protocol-encapsulation + remote-to-local network-flows + local protocol-decode + ... ).
Sketched process-flow framework :
from joblib import Parallel, delayed
MAX_WORKERs = ( n_CPU_cores - 1 )
def main( files_in ):
""" __doc__
.INIT worker-processes, each with a split-scope of tasks
"""
IDs = range( max( 1, MAX_WORKERs ) )
RES_if_need = Parallel( n_jobs = MAX_WORKERs
)( delayed( block_processor_FUN #-- fun CALLABLE
)( my_file, #---------- fun PAR1
wPROC #---------- fun PAR2
)
for wPROC in IDs
)
def block_processor_FUN( file_with_URLs = None,
file_from_PART = 0
):
""" __doc__
.OPEN file_with_URLs
.READ file_from_PART, row-wise - till next part starts
- ref. global MAX_WORKERs
"""
...
This is the initial Python-interpreter __main__
-side trick to spawn just-enough worker-processes, that start crawling the my_file
-"list" of URL-s independently AND an indeed just-[CONCURENT]
-flow of work starts, one being independent of any other.
The block_processor_FUN()
, passed by reference to the workers does simlpy open the file, and starts fetching/processing only its "own"-fraction, being from ( wPROC / MAX_WORKERs )
to ( ( wPROC + 1 ) / MAX_WORKERs )
of it's number of lines.
That simple.
If willing to tune-up corner-cases, where some URL may take and takes longer, than one may improve the form of load-balancing fair-queueing, yet at a cost of more complex design ( many process-to-process messaging queues are available ), having a { __main__ | main() }
-side FQ/LB-feeder and making worker-processes retrieve their next task from such job-request FQ/LB-facility.
More complex & more robust to uneven distribution of URL-serving durations "across" the my_file
-ordered list of URL-s to serve.
The choices of levels of simplicity / complexity compromises, that impact the resulting performance / robustness are yours.
For more details you may like to read this and code from this and there directed examples or tips for further performace-boosting.
Upvotes: 1