Presense of input() function causes multiprocessing to fail. What is the reason behind this?

Question

I have a simple web scraping script that uses multiprocessing. I want the user to choose which excel file is being scraped, so input() is used at the beginning.

Without the multiprocessing code, the script runs fine (though it's processing links one at a time). With the multiprocessing code, the script hangs indefinitely. This is true even if I don't use the string collected from input() in the script, so it seems it's just the very presence of input() that is causing the script to hang with multiprocessing present.

I haven't got a clue why this would be the case. Any insight is really appreciated.

The code:

os.chdir(os.path.curdir)

# excel_file_name_b is not used in the script at all, but because
# it exists, the script hangs. Ideally I want to keep input() in the script
excel_file_name_b = input()
excel_file_name = "URLs.xlsx"

excel_file = openpyxl.load_workbook(excel_file_name)
active_sheet = excel_file.active
rows = active_sheet.max_row

for i in range(2,rows+1,1):
    list.append(active_sheet.cell(row=i,column=1).value)

headers = {"User-Agent":'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36',"Accept-Language": 'en-GB'}

def scrape(url):
    try:
        res = get(url, headers = headers)
        html_soup = BeautifulSoup(res.text, 'lxml')
        html_element = html_soup.select('._3pvwV0k')
        return res.url, html_element[0].getText()
    except:
        return res.url, "Not found or error"
        pass

if __name__ == '__main__':
    p = Pool(10)
    scrape_return = p.map(scrape, list)
    for k in range(len(scrape_return)):
        try:
            active_sheet.cell(row=k+2, column=2).value = scrape_return[k][0]
            active_sheet.cell(row=k+2, column=3).value = scrape_return[k][1]
        except:
            continue

excel_file.save(excel_file_name)

MyNameIsCaleb · Accepted Answer

Because your input() is at module level, each process is calling it to have it available for the process to use.

Multiprocessing closes stdin which is what causes each of your errors. [docs]

If you move it to if __name__ == '__main__': you shouldn't have the problem any more.

Edit: reformatting your code more similar to below will probably clear up the other issues with it not performing as expected.


def scrape(url):
    headers = {"User-Agent":'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36',"Accept-Language": 'en-GB'}
    try:
        res = get(url, headers=headers)
        html_soup = BeautifulSoup(res.text, 'lxml')
        html_element = html_soup.select('._3pvwV0k')
        return res.url, html_element[0].getText()
    except:
        return res.url, "Not found or error"
        pass


def main():
    excel_file_name_b = input()
    excel_file_name = "URLs.xlsx"
    excel_file = openpyxl.load_workbook(excel_file_name)
    active_sheet = excel_file.active
    rows = active_sheet.max_row

    for i in range(2,rows+1,1):
        list.append(active_sheet.cell(row=i,column=1).value)   # rename this object, list is a keyword



    p = Pool(10)
    scrape_return = p.map(scrape, list)   # rename here too
    for k in range(len(scrape_return)):
        try:
            active_sheet.cell(row=k+2, column=2).value = scrape_return[k][0]
            active_sheet.cell(row=k+2, column=3).value = scrape_return[k][1]
        except:
            continue

    excel_file.save(excel_file_name)

if __name__ == '__main__':
    main()

Presense of input() function causes multiprocessing to fail. What is the reason behind this?

Answers (1)

Related Questions