Ashish Mishra
Ashish Mishra

Reputation: 169

Optimization in Multithreading in Python to download files

I'm new to python. Currently, I'm trying to implement a program to download large no of files from the remote server(http/https). The no. of files are large(>1000). To handle this I need to implement the code in a way so it can utilize the OS resource in a efficient and optimal way. To handle this the way which I took is multiprocessing.

Here, is my implementation :

import urllib,urlparse
import urllib2
import os
import multiprocessing
from multiprocessing.dummy import Pool as ThreadPool
from itertools import repeat

def download_file((url, d_dir)) :
    #logger.debug('Download URL -> ' + url)

    try :
        with open(d_dir + os.sep + urlparse.urlparse(url).path, 'wb') as tfile :
            tfile.write(urllib2.urlopen(url).read())

    except :
        logger.error('There was a some problem while downloading file, ' + url)


def create_pool(d_links, d_dir) :
    pool = multiprocessing.Pool(processes=10)
    pool.map(download_file, zip(d_links, repeat(d_dir)))

def extract_urls() :
    # some logic to extract urls from files
    links = {‘url1’, ‘url2’, ‘url3’, ‘url4’, ‘url5’, …} 

    #created  process pool
    create_pool(links, l_dir)

If I run this code, it gives me normal output. But I think I didn't implemented the multiprocessing correctly. Can you please give some input to optimize this piece of code?

Thanks in advance.

Regards, Ashish

Upvotes: 0

Views: 967

Answers (2)

sdikby
sdikby

Reputation: 1471

I had the same problem in python 2.7. The issue is that the multiprocessing library don't support more than one argument in pool.map(func,arg). As a solution i have used the multiprocessing library from pathos. So your function could be as follow

from pathos.multiprocessing import ProcessingPool as Pool
from itertools import izip

p = Pool(self.nbr_processes) 
        try:
            p.map(download_file, izip(d_links, repeat(d_dir)))
            p.close()
            p.join()

        except Exception as f:
            logging.error(f)

Upvotes: 0

Matthias Gilch
Matthias Gilch

Reputation: 306

You may do this

import multiprocessing as mp
with mp.Pool(4) as pool:
    pool.map_async(download_file, zip(d_links, repeat(d_dir)))

Reference: https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.pool

Note that map_async does the job parallel, but map blocks the process until the called function returns

Upvotes: 2

Related Questions