Reputation: 704
I am trying to simplify some code which does the following:
The process looks like
list1 = []
def fun(df):
for x in df['Col']:
url = "my_website"+x
soup = BeautifulSoup(requests.get(url).content, "html.parser")
...
list1.append(data1)
return list1
list1 = fun(my_df)
my_df['List1'] = list1
(I tried to keep the code as simpler as possible)
The output looks like (the column Col
is my initial dataframe, i.e. my_df
):
Col List1
mouse [dog, horse, cat]
horse [mouse, elephant]
tiger []
Then, I repeat the process for strings in the list for each row:
# 2nd round
list1 = []
my_df2 = my_df.explode('List1')
my_list2 = pd.Series(list(set(my_df2['List1']) - set(my_df['Col'])), name='Col')
new_df2 = pd.DataFrame(my_list2, columns=['Col'])
list1 = fun(new_df2)
new_df2['List1'] = list1
Then I have another dataframe with other values, so I append these results to my original dataframe, my_df
my_df2= my_df.append(new_df2)
I repeat again the process
# 3rd round
list1 = []
my_df3 = my_df2.explode('List1')
my_list3 = pd.Series(list(set(my_df3['List1']) - set(my_df2['Col'])), name='Col')
new_df3 = pd.DataFrame(my_list3, columns=['Col'])
list1 = fun(new_df3)
new_df3['List1'] = list1
and so on, until I have finished to scrape all the data.
Since I am repeating these 'rounds' every time manually, I would like to ask you if there is a way to simplify the code in order to avoid all these awful repetition. Any tips will be appreciated.
EDIT: my difficulties are in setting a condition where, if I have my original dataset, i.e. before creating the column List1, then create the empty list list1 then apply fun to my dataset.
In the other steps, I should:
If you need more information, I will be happy to provide it.
Upvotes: 0
Views: 647
Reputation: 261
As far as I can tell from my_df
, the list1 declaration should be inside fun
, or you're emptying it elsewhere.
First, I would change fun
to only work on one entry (not whole Series):
def fun(x):
list1 = []
url = "my_website"+x
soup = BeautifulSoup(requests.get(url).content, "html.parser")
...
list1.append(data1)
return list1
Then, you can do the first transformation (populating second column List1) by doing:
my_df['List1'] = my_df.Col.apply(lambda x: fun(x))
After that, you could do something like:
while scraping_to_do:
newCol = pd.Series(list(set(my_df['List1']) - set(my_df['Col'])))
newList1 = newCol.apply(lambda x: fun(x))
my_df = my_df.append(pd.DataFrame(dict('Col'=newCol, 'List1'=newList1)), ignore_index=True)
my_df = my_df.explode('List1')
You need to figure out when to stop scraping (when the set difference is the empty set?), as well as deal with the NaNs that explode
produces from empty lists.
Upvotes: 1