Michael
Michael

Reputation: 16132

Multiprocessing slowdowns my web-crawler?

I want to download 20 csv files with the size of all of them together - 5MB.
Here is the first version of my code:

import os
from bs4 import BeautifulSoup
import urllib.request
import datetime

def get_page(url):
    try:
        return urllib.request.urlopen(url).read()
    except:
        print("[warn] %s" % (url))
        raise

def get_all_links(page):
    soup = BeautifulSoup(page)
    links = []
    for link in soup.find_all('a'):
        url = link.get('href')
        if '.csv' in url:
            return url
    print("[warn] Can't find a link with CSV file!")

def get_csv_file(company):
    link = 'http://finance.yahoo.com/q/hp?s=AAPL+Historical+Prices'
    g = link.find('s=')
    name = link[g + 2:g + 6]
    link = link.replace(name, company)
    urllib.request.urlretrieve(get_all_links(get_page(link)), os.path.join('prices', company + '.csv'))
    print("[info][" + company + "] Download is complete!")

if __name__ == "__main__":
    start = datetime.datetime.now()
    security_list = ["AAPL", "ADBE", "AMD", "AMZN", "CRM", "EXPE", "FB", "GOOG", "GRPN", "INTC", "LNKD", "MCD", "MSFT", "NFLX", "NVDA", "NVTL", "ORCL", "SBUX", "STX"]
    for security in security_list:
        get_csv_file(security)

    end = datetime.datetime.now()
    print('[success] Total time: ' + str(end-start))

This code downloads 20 csv files with the size of all of them together - 5MB, within 1.2 minute.
Then i have tried to use multiprocessing to make it download faster.
Here is version 2:

if __name__ == "__main__":
    import multiprocessing
    start = datetime.datetime.now()

    security_list = ["AAPL", "ADBE", "AMD", "AMZN", "CRM", "EXPE", "FB", "GOOG", "GRPN", "INTC", "LNKD", "MCD", "MSFT", "NFLX", "NVDA", "NVTL", "ORCL", "SBUX", "STX"]
    for i in range(20):
        p = multiprocessing.Process(target=hP.get_csv_files([index] + security_list), args=(i,))
        p.start()

    end = datetime.datetime.now()
    print('[success] Total time: ' + str(end-start))

But, unfortunately version 2 downloads 20 csv files with the size of all of them together - 5MB, within 2.4 minutes.

Why multiprocessing slowdowns my program?
What am I doing wrong?
What is the best way to download these files faster than now?

Thank you?

Upvotes: 0

Views: 302

Answers (1)

rolisz
rolisz

Reputation: 11832

I don't know what exactly you are trying to start with Process in your example (I think you have a few typos). I think you want something like this:

processs = []
for security in security_list:
    p = multiprocessing.Process(target=get_csv_file, args=(security,))
    p.start()
    processs.append(p)

for p in processs:
    p.join()

You can iterate in this way over the security, create a new process for each security name and put the process in a list.

After you started all the processes, you loop over them and wait for them to finish, using join.

There is also a simpler way to do this, using Pool and its parallel map implementation.

pool = multiprocessing.Pool(processes=5)
pool.map(get_csv_file, security_list)

You create a Pool of processes (if you omit the argument, it will create a number equal to your processor count), and then you apply your function to each element in the list using map. The pool will take care of the rest.

Upvotes: 4

Related Questions