Multiprocessing in Python: web scraping doesn't speed up

Question

I would like to use the multiprocessing module to speed up web scraping. My goal is to extract a part of HTML in a page and save it in a parent variable. Finally, write that variable into a file.

But the problem I have is that it takes around 1 second to process a page.

My code works, but it does not do what I want:

import urllib.request
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool  # This is a thread-based Pool
from multiprocessing import cpu_count


def parseWeb(url):
    page = urllib.request.urlopen(url)
    soup = BeautifulSoup(page)
    h2_tag = soup.find('h2', class_='midashigo')
    return h2_tag

if __name__ == '__main__':
    file = 'links.txt' # each link is on a separate line.
    pool = Pool(cpu_count() * 2)
    with open(file, 'r') as f:
        results = pool.map(parseWeb, f)
    with open('output.txt', 'w', encoding='utf-8') as w:
        w.write(str(results))

How can it be modified to give it the full power of multiprocessing? Thank you.

Multiprocessing in Python: web scraping doesn't speed up

Answers (1)

Related Questions

Multiprocessing in Python: web scraping doesn&#39;t speed up

Answers (1)

Related Questions

Multiprocessing in Python: web scraping doesn't speed up