Multiprocessing Runtime error with Python script

Question

Im getting the following error while running my code:

...
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
...

Im running a script in the following way:

if __name__ == '__main__':
    object_ = NameToLinkedInScraper(csv_file_name='people_small.csv', person_name_column='company', organization_name_column='email')
    object_.multiproc_job()

and the class:

class NameToLinkedInScraper:
    pool = Pool()

    proxy_list = PROXY_LIST
    csv_file_name = None
    person_name_column = None
    organization_name_column = None
    find_facebook = False
    find_twitter = False

    def __init__(self, csv_file_name, person_name_column, organization_name_column, find_facebook=False,
                 find_twitter=False):
        self.csv_file_name: str
        self.person_name_column: str
        self.organization_name_column: str
        self.df = pd.read_csv(csv_file_name)

    def internal_linkedin_job(self, _df):
        _df['linkedin_profile'] = np.nan
        _df['linkedin_profile'] = _df.apply(
            lambda row: term_scraper(
                str(row[self.person_name_column]) + " " + str(row[self.organization_name_column]), self.proxy_list,
                'link', output_generic=False), axis=1)

    def internal_generic_linkedin_job(self, _df):
        _df['linkedin_generic'] = np.nan
        _df['linkedin_generic'] = _df.apply(
            lambda row: term_scraper(
                str(row[self.person_name_column]) + " " + str(row[self.organization_name_column]), self.proxy_list,
                'link', output_generic=True), axis=1)

    def internal_facebook_twitter_job(self, _df):
        _df['title'] = np.nan
        _df['title'] = _df.apply(
            lambda row: term_scraper(
                str(row[self.person_name_column]) + " " + str(row[self.organization_name_column]), self.proxy_list,
                'title'), axis=1)
        if self.find_facebook:
            _df['facebook_profile'] = np.nan
            _df['facebook_profile'] = _df.apply(
                lambda row: term_scraper(
                    str(row[self.person_name_column]) + " " + str(row[self.organization_name_column]), self.proxy_list,
                    'link', output_generic=False, social_network='facebook'), axis=1)
        if self.find_twitter:
            _df['twitter_profile'] = np.nan
            _df['twitter_profile'] = _df.apply(
                lambda row: term_scraper(
                    str(row[self.person_name_column]) + " " + str(row[self.organization_name_column]), self.proxy_list,
                    'link', output_generic=False, social_network='twitter'), axis=1)

    def multiproc_job(self):
        linkedin_profile_proc = Process(target=self.internal_linkedin_job, args=self.df)
        linkedin_generic_profile_proc = Process(target=self.internal_generic_linkedin_job, args=self.df)
        internal_facebook_twitter_job = Process(target=self.internal_facebook_twitter_job, args=self.df)
        jobs = [linkedin_profile_proc, linkedin_generic_profile_proc, internal_facebook_twitter_job]
        for j in jobs:
            j.start()

        for j in jobs:
            j.join()
        self.df.to_csv(sys.path[0] + "\" + self.csv_file_name + "_" + ".csv")

And I can't put my finger on what is wrong, The script is running in Windows and I couldn't find an answer. I tried adding freeze_support() to the main with no success, also transferring the process creation and job assignment from the class to the main.

Aaron · Accepted Answer

By creating the pool as a class attribute, it gets executed when NameToLinkedInScraper is defined during import (the "main" file is imported by children so they have access to all the same classes and functions). If it were allowed to do this it would recursively keep creating more children who would then import the same file and create more children themselves. This is why spawning child processes is disabled on __main__ import. You should instead only call Pool in __init__ so new child processes are only created if you create an instance of your class. In general using class attributes rather than instance attributes should be avoided unless it is static data, or needs to be shared between all instances of the class.

Multiprocessing Runtime error with Python script

Answers (1)

Related Questions