Reputation: 73
I have a simple web scraping script that uses multiprocessing. I want the user to choose which excel file is being scraped, so input()
is used at the beginning.
Without the multiprocessing code, the script runs fine (though it's processing links one at a time). With the multiprocessing code, the script hangs indefinitely. This is true even if I don't use the string collected from input()
in the script, so it seems it's just the very presence of input()
that is causing the script to hang with multiprocessing present.
I haven't got a clue why this would be the case. Any insight is really appreciated.
The code:
os.chdir(os.path.curdir)
# excel_file_name_b is not used in the script at all, but because
# it exists, the script hangs. Ideally I want to keep input() in the script
excel_file_name_b = input()
excel_file_name = "URLs.xlsx"
excel_file = openpyxl.load_workbook(excel_file_name)
active_sheet = excel_file.active
rows = active_sheet.max_row
for i in range(2,rows+1,1):
list.append(active_sheet.cell(row=i,column=1).value)
headers = {"User-Agent":'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36',"Accept-Language": 'en-GB'}
def scrape(url):
try:
res = get(url, headers = headers)
html_soup = BeautifulSoup(res.text, 'lxml')
html_element = html_soup.select('._3pvwV0k')
return res.url, html_element[0].getText()
except:
return res.url, "Not found or error"
pass
if __name__ == '__main__':
p = Pool(10)
scrape_return = p.map(scrape, list)
for k in range(len(scrape_return)):
try:
active_sheet.cell(row=k+2, column=2).value = scrape_return[k][0]
active_sheet.cell(row=k+2, column=3).value = scrape_return[k][1]
except:
continue
excel_file.save(excel_file_name)
Upvotes: 1
Views: 116
Reputation: 4489
Because your input()
is at module level, each process is calling it to have it available for the process to use.
Multiprocessing closes stdin which is what causes each of your errors. [docs]
If you move it to if __name__ == '__main__':
you shouldn't have the problem any more.
Edit: reformatting your code more similar to below will probably clear up the other issues with it not performing as expected.
def scrape(url):
headers = {"User-Agent":'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36',"Accept-Language": 'en-GB'}
try:
res = get(url, headers=headers)
html_soup = BeautifulSoup(res.text, 'lxml')
html_element = html_soup.select('._3pvwV0k')
return res.url, html_element[0].getText()
except:
return res.url, "Not found or error"
pass
def main():
excel_file_name_b = input()
excel_file_name = "URLs.xlsx"
excel_file = openpyxl.load_workbook(excel_file_name)
active_sheet = excel_file.active
rows = active_sheet.max_row
for i in range(2,rows+1,1):
list.append(active_sheet.cell(row=i,column=1).value) # rename this object, list is a keyword
p = Pool(10)
scrape_return = p.map(scrape, list) # rename here too
for k in range(len(scrape_return)):
try:
active_sheet.cell(row=k+2, column=2).value = scrape_return[k][0]
active_sheet.cell(row=k+2, column=3).value = scrape_return[k][1]
except:
continue
excel_file.save(excel_file_name)
if __name__ == '__main__':
main()
Upvotes: 2