Reputation: 496
I want to parse a website with multiple pages.
I don't know the number of pages. This is the original code:
next_button=soup.find_all('a',{'class':"btn-page_nav right"})
while next_button:
link=next_button[0]['href']
resp=requests.get('webpage+link)
soup=BeautifulSoup(resp.content)
table=soup.find('table',{'class':'js-searchresults'})
body=table.find('tbody')
rows=body.find_all('tr')
function(rows)
next_button=soup.find_all('a',{'class':"btn-page_nav right"})
It works fine, function(rows)
is a function which parses a part of each page.
What i want to do is use multiprocessing
to parse these pages. I thought about using a pool
of 3 workers so that i could process 3 pages at once but i can't figure out how to implement it.
One solution is this:
rows_list=[]
next_button=soup.find_all('a',{'class':"btn-page_nav right"})
while next_button:
link=next_button[0]['href']
resp=requests.get('webpage+link)
soup=BeautifulSoup(resp.content)
table=soup.find('table',{'class':'js-searchresults'})
body=table.find('tbody')
rows=body.find_all('tr')
rows_list.append(rows)
next_button=soup.find_all('a',{'class':"btn-page_nav right"})
Wait for the program to loop through all pages and then:
pool=multiprocessing.Pool(processes=4)
pool.map(function,rows_list)
But i don't think this will increase the performance too much, i would like the main process to loop through the pages and as soon as it opens a page, send it to a worker. How can this be done? A dummy example:
pool=multiprocessing.Pool(processes=4)
next_button=soup.find_all('a',{'class':"btn-page_nav right"})
while next_button:
link=next_button[0]['href']
resp=requests.get('webpage+link)
soup=BeautifulSoup(resp.content)
table=soup.find('table',{'class':'js-searchresults'})
body=table.find('tbody')
rows=body.find_all('tr')
**pool.send_to_idle_worker(rows)**
next_button=soup.find_all('a',{'class':"btn-page_nav right"})
Upvotes: 0
Views: 274
Reputation: 12205
Can you use Pool.apply_async()
instead of Pool.map()
? Apply_async does not block and would allow your main program to keep processing for more rows. It also does not require your main program to have all the data ready to be mapped. You would just pass one chunk as a parameter to apply_async()
.
Upvotes: 1