How to avoid code repetition and redundancy

Question

I am trying to simplify some code which does the following:

create one empty list where to store the information scraped from one website
apply a function to fill the list
add these information in a dataframe
repeat the process selecting element not scraped yet

The process looks like

list1 = []

def fun(df):
    for x in df['Col']: 
        url = "my_website"+x
        soup = BeautifulSoup(requests.get(url).content, "html.parser")
...
        list1.append(data1)
    return list1

list1 = fun(my_df)

my_df['List1'] = list1

(I tried to keep the code as simpler as possible) The output looks like (the column Col is my initial dataframe, i.e. my_df):

Col          List1
mouse   [dog, horse, cat]   
horse   [mouse, elephant]   
tiger   []

Then, I repeat the process for strings in the list for each row:

# 2nd round
list1 = []
my_df2 = my_df.explode('List1')
my_list2 = pd.Series(list(set(my_df2['List1']) - set(my_df['Col'])), name='Col')
new_df2 = pd.DataFrame(my_list2, columns=['Col'])

list1 = fun(new_df2)

new_df2['List1'] = list1

Then I have another dataframe with other values, so I append these results to my original dataframe, my_df

my_df2= my_df.append(new_df2)

I repeat again the process

# 3rd round
list1 = []
my_df3 = my_df2.explode('List1')
my_list3 = pd.Series(list(set(my_df3['List1']) - set(my_df2['Col'])), name='Col')
new_df3 = pd.DataFrame(my_list3, columns=['Col'])

list1 = fun(new_df3)

new_df3['List1'] = list1

and so on, until I have finished to scrape all the data.

Since I am repeating these 'rounds' every time manually, I would like to ask you if there is a way to simplify the code in order to avoid all these awful repetition. Any tips will be appreciated.

EDIT: my difficulties are in setting a condition where, if I have my original dataset, i.e. before creating the column List1, then create the empty list list1 then apply fun to my dataset.

In the other steps, I should:

Initialise again the list1 in order to get a new dataframe from the list1 by exploding the column in my original (now update) dataset)
Calculate the difference between this new dataframe and the previous one, to remove duplicates
Run again the fun, storing the results in a column List1
Append the results to the dataframe updated (which would be always the previous dataframe)
repeat again the process.

If you need more information, I will be happy to provide it.

lbd · Accepted Answer

As far as I can tell from my_df, the list1 declaration should be inside fun, or you're emptying it elsewhere.

First, I would change fun to only work on one entry (not whole Series):

def fun(x):
    list1 = []
    url = "my_website"+x
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    ...
    list1.append(data1)
    return list1

Then, you can do the first transformation (populating second column List1) by doing:

my_df['List1'] = my_df.Col.apply(lambda x: fun(x))

After that, you could do something like:

while scraping_to_do:
     newCol = pd.Series(list(set(my_df['List1']) - set(my_df['Col'])))
     newList1 = newCol.apply(lambda x: fun(x))
     my_df = my_df.append(pd.DataFrame(dict('Col'=newCol, 'List1'=newList1)), ignore_index=True)
     my_df = my_df.explode('List1')

You need to figure out when to stop scraping (when the set difference is the empty set?), as well as deal with the NaNs that explode produces from empty lists.

How to avoid code repetition and redundancy

Answers (1)

Related Questions