Python multiprocessing- using workers on demand

Question

I want to parse a website with multiple pages.

I don't know the number of pages. This is the original code:

        next_button=soup.find_all('a',{'class':"btn-page_nav right"})
        while next_button:
            link=next_button[0]['href']
            resp=requests.get('webpage+link)
            soup=BeautifulSoup(resp.content)
            table=soup.find('table',{'class':'js-searchresults'})
            body=table.find('tbody')
            rows=body.find_all('tr')
            function(rows)
            next_button=soup.find_all('a',{'class':"btn-page_nav right"})

It works fine, function(rows) is a function which parses a part of each page.

What i want to do is use multiprocessing to parse these pages. I thought about using a pool of 3 workers so that i could process 3 pages at once but i can't figure out how to implement it.

One solution is this:

rows_list=[]
next_button=soup.find_all('a',{'class':"btn-page_nav right"})
while next_button:
    link=next_button[0]['href']
    resp=requests.get('webpage+link)
    soup=BeautifulSoup(resp.content)
    table=soup.find('table',{'class':'js-searchresults'})
    body=table.find('tbody')
    rows=body.find_all('tr')
    rows_list.append(rows)
    next_button=soup.find_all('a',{'class':"btn-page_nav right"})

Wait for the program to loop through all pages and then:

pool=multiprocessing.Pool(processes=4)
pool.map(function,rows_list)

But i don't think this will increase the performance too much, i would like the main process to loop through the pages and as soon as it opens a page, send it to a worker. How can this be done? A dummy example:

pool=multiprocessing.Pool(processes=4)

next_button=soup.find_all('a',{'class':"btn-page_nav right"})
while next_button:
    link=next_button[0]['href']
    resp=requests.get('webpage+link)
    soup=BeautifulSoup(resp.content)
    table=soup.find('table',{'class':'js-searchresults'})
    body=table.find('tbody')
    rows=body.find_all('tr')

    **pool.send_to_idle_worker(rows)**

    next_button=soup.find_all('a',{'class':"btn-page_nav right"})

Python multiprocessing- using workers on demand

Answers (1)

Related Questions