Reputation: 89
Im getting the following error while running my code:
...
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
...
Im running a script in the following way:
if __name__ == '__main__':
object_ = NameToLinkedInScraper(csv_file_name='people_small.csv', person_name_column='company', organization_name_column='email')
object_.multiproc_job()
and the class:
class NameToLinkedInScraper:
pool = Pool()
proxy_list = PROXY_LIST
csv_file_name = None
person_name_column = None
organization_name_column = None
find_facebook = False
find_twitter = False
def __init__(self, csv_file_name, person_name_column, organization_name_column, find_facebook=False,
find_twitter=False):
self.csv_file_name: str
self.person_name_column: str
self.organization_name_column: str
self.df = pd.read_csv(csv_file_name)
def internal_linkedin_job(self, _df):
_df['linkedin_profile'] = np.nan
_df['linkedin_profile'] = _df.apply(
lambda row: term_scraper(
str(row[self.person_name_column]) + " " + str(row[self.organization_name_column]), self.proxy_list,
'link', output_generic=False), axis=1)
def internal_generic_linkedin_job(self, _df):
_df['linkedin_generic'] = np.nan
_df['linkedin_generic'] = _df.apply(
lambda row: term_scraper(
str(row[self.person_name_column]) + " " + str(row[self.organization_name_column]), self.proxy_list,
'link', output_generic=True), axis=1)
def internal_facebook_twitter_job(self, _df):
_df['title'] = np.nan
_df['title'] = _df.apply(
lambda row: term_scraper(
str(row[self.person_name_column]) + " " + str(row[self.organization_name_column]), self.proxy_list,
'title'), axis=1)
if self.find_facebook:
_df['facebook_profile'] = np.nan
_df['facebook_profile'] = _df.apply(
lambda row: term_scraper(
str(row[self.person_name_column]) + " " + str(row[self.organization_name_column]), self.proxy_list,
'link', output_generic=False, social_network='facebook'), axis=1)
if self.find_twitter:
_df['twitter_profile'] = np.nan
_df['twitter_profile'] = _df.apply(
lambda row: term_scraper(
str(row[self.person_name_column]) + " " + str(row[self.organization_name_column]), self.proxy_list,
'link', output_generic=False, social_network='twitter'), axis=1)
def multiproc_job(self):
linkedin_profile_proc = Process(target=self.internal_linkedin_job, args=self.df)
linkedin_generic_profile_proc = Process(target=self.internal_generic_linkedin_job, args=self.df)
internal_facebook_twitter_job = Process(target=self.internal_facebook_twitter_job, args=self.df)
jobs = [linkedin_profile_proc, linkedin_generic_profile_proc, internal_facebook_twitter_job]
for j in jobs:
j.start()
for j in jobs:
j.join()
self.df.to_csv(sys.path[0] + "\\" + self.csv_file_name + "_" + ".csv")
And I can't put my finger on what is wrong, The script is running in Windows and I couldn't find an answer. I tried adding freeze_support() to the main with no success, also transferring the process creation and job assignment from the class to the main.
Upvotes: 0
Views: 129
Reputation: 11075
By creating the pool
as a class attribute, it gets executed when NameToLinkedInScraper
is defined during import (the "main" file is imported by children so they have access to all the same classes and functions). If it were allowed to do this it would recursively keep creating more children who would then import the same file and create more children themselves. This is why spawning child processes is disabled on __main__
import. You should instead only call Pool
in __init__
so new child processes are only created if you create an instance of your class. In general using class attributes rather than instance attributes should be avoided unless it is static data, or needs to be shared between all instances of the class.
Upvotes: 2