Reputation: 169
I'm new to python. Currently, I'm trying to implement a program to download large no of files from the remote server(http/https). The no. of files are large(>1000). To handle this I need to implement the code in a way so it can utilize the OS resource in a efficient and optimal way. To handle this the way which I took is multiprocessing.
Here, is my implementation :
import urllib,urlparse
import urllib2
import os
import multiprocessing
from multiprocessing.dummy import Pool as ThreadPool
from itertools import repeat
def download_file((url, d_dir)) :
#logger.debug('Download URL -> ' + url)
try :
with open(d_dir + os.sep + urlparse.urlparse(url).path, 'wb') as tfile :
tfile.write(urllib2.urlopen(url).read())
except :
logger.error('There was a some problem while downloading file, ' + url)
def create_pool(d_links, d_dir) :
pool = multiprocessing.Pool(processes=10)
pool.map(download_file, zip(d_links, repeat(d_dir)))
def extract_urls() :
# some logic to extract urls from files
links = {‘url1’, ‘url2’, ‘url3’, ‘url4’, ‘url5’, …}
#created process pool
create_pool(links, l_dir)
If I run this code, it gives me normal output. But I think I didn't implemented the multiprocessing correctly. Can you please give some input to optimize this piece of code?
Thanks in advance.
Regards, Ashish
Upvotes: 0
Views: 967
Reputation: 1471
I had the same problem in python 2.7. The issue is that the multiprocessing
library don't support more than one argument in pool.map(func,arg)
. As a solution i have used the multiprocessing library from pathos
.
So your function could be as follow
from pathos.multiprocessing import ProcessingPool as Pool
from itertools import izip
p = Pool(self.nbr_processes)
try:
p.map(download_file, izip(d_links, repeat(d_dir)))
p.close()
p.join()
except Exception as f:
logging.error(f)
Upvotes: 0
Reputation: 306
You may do this
import multiprocessing as mp
with mp.Pool(4) as pool:
pool.map_async(download_file, zip(d_links, repeat(d_dir)))
Reference: https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.pool
Note that map_async does the job parallel, but map blocks the process until the called function returns
Upvotes: 2