DanEng
DanEng

Reputation: 428

Multiprocessing in Python: web scraping doesn't speed up

I would like to use the multiprocessing module to speed up web scraping. My goal is to extract a part of HTML in a page and save it in a parent variable. Finally, write that variable into a file.

But the problem I have is that it takes around 1 second to process a page.

My code works, but it does not do what I want:

import urllib.request
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool  # This is a thread-based Pool
from multiprocessing import cpu_count


def parseWeb(url):
    page = urllib.request.urlopen(url)
    soup = BeautifulSoup(page)
    h2_tag = soup.find('h2', class_='midashigo')
    return h2_tag

if __name__ == '__main__':
    file = 'links.txt' # each link is on a separate line.
    pool = Pool(cpu_count() * 2)
    with open(file, 'r') as f:
        results = pool.map(parseWeb, f)
    with open('output.txt', 'w', encoding='utf-8') as w:
        w.write(str(results))

How can it be modified to give it the full power of multiprocessing? Thank you.

Upvotes: 1

Views: 4064

Answers (1)

Leon
Leon

Reputation: 12491

This process should be I/O bound, meaning your bottle neck should be how much you can pull down the connection before parsing, but in practice it may turn out to be CPU or memory bound.

The first thing you need to realize is that multithreading/processing is not going to speed up individual page parsing times. So if one page takes one second and you have 420000 pages it will take 420000 seconds. If you up the number of thread to the amount of cores your PC has times two and you pc has 4 cores, then you are going to have 8 threads running 1 second each per page. You still end up with 420000 / 8 seconds which is 875 minutes (in practice this will not be entirely true), which is 14.5 hours worth of processing....

For the time spans to be manageable you will need about 400 threads, which will bring processing time down to a theoretical 17 odd minutes.

With so many threads running and pages being parsed memory is going to become a problem as well.

I slapped together this little app to test some times

from time import sleep

from multiprocessing.dummy import Pool
from multiprocessing import cpu_count


def f(x):
    sleep(1)
    x = int(x)
    return x *x

if __name__ == '__main__':
    pool = Pool(cpu_count() * 100)

    with open('input.txt', 'r') as i:
        results = pool.map(f, i)
    with open('output.txt', 'w') as w:
        w.write(str(results))

With an input file of numbers 1 to 420 000, the time to process took 1053.39 seconds (roughly 17.5 minutes), but this is not a good indicator of how long it will take for you, since with the mentioned memory and I/O bound issues, you could end up with something that is significantly slower.

The bottom line is, if you are not maxing out your CPU or RAM or network I/O, then your thread pool is too small.

Upvotes: 7

Related Questions