Isa
Isa

Reputation: 145

Avoid duplicate words when scraping a web page_python

I scrape a webpage taking elements from a list (a column of my df converted to list which contain duplicate words) and return the result into a df.I need to find a way of excluding duplicates when scraping (to reduce time) but, in the same time in case of duplicates, I need to fill the export value for all the duplicate words. Example:

my_column         `result`
string1            Yes
string2            No
string3            Yes
string2            No
string1            Yes
string4            No

This is obtained by using keywords from my_column, one by one, without avoiding duplicates. Is there a logic to be used so that in case of duplicates only the first value to be used in scraping but in the result column to have filled the result for each keyword?

This is my code

 for keyword in final_list:
                for index, row in data_splitted2.iterrows():
                    if keyword == row['my_column']:  
                        if keyword == None:
                            break
                        # print(keyword)

                        link = website + 'search/q?name=' + keyword
                        driver.get(link)
                        time.sleep(5)

                        try:
                            status = driver.find_element_by_class_name("yyyyy")
                            row['result'] = status.text


                        except NoSuchElementException:
                            pass

One last mention, in my final df, I need to keep the duplicate keywords so they should be passed when scraping but present in my final df.

Many thanks in advance

`

Upvotes: 2

Views: 151

Answers (1)

Jack Fleeting
Jack Fleeting

Reputation: 24930

If I understand you correctly, you are probably looking for something like the below. It's very simplified, just to respond to the particular issue:

Lets assume you have this dataframe:

data = ['string1', 'string2', 'string3', 'string2', 'string1', 'string4']
result = ['','','','','','']

df = pd.DataFrame(columns=["my_column",'result'])
df['my_column'],df['result'] = data,result

We can skip the duplicates in performing operations, but assign the results of these operations to all rows, including the duplicates:

for val in df.my_column.unique():
    state = "Yes" if random.randint(1,2)==1 else "No"
    #in your actual code, the above line will probably have to be replaced with status.text
    df.loc[df['my_column'] == val, 'result'] = state
df

Random output:

my_column   result

0   string1     No
1   string2     Yes
2   string3     Yes
3   string2     Yes
4   string1     No
5   string4     No

Upvotes: 1

Related Questions