Reputation: 145
I scrape a webpage taking elements from a list (a column of my df converted to list which contain duplicate words) and return the result into a df.I need to find a way of excluding duplicates when scraping (to reduce time) but, in the same time in case of duplicates, I need to fill the export value for all the duplicate words. Example:
my_column `result`
string1 Yes
string2 No
string3 Yes
string2 No
string1 Yes
string4 No
This is obtained by using keywords from my_column, one by one, without avoiding duplicates. Is there a logic to be used so that in case of duplicates only the first value to be used in scraping but in the result column to have filled the result for each keyword?
This is my code
for keyword in final_list:
for index, row in data_splitted2.iterrows():
if keyword == row['my_column']:
if keyword == None:
break
# print(keyword)
link = website + 'search/q?name=' + keyword
driver.get(link)
time.sleep(5)
try:
status = driver.find_element_by_class_name("yyyyy")
row['result'] = status.text
except NoSuchElementException:
pass
One last mention, in my final df, I need to keep the duplicate keywords so they should be passed when scraping but present in my final df.
Many thanks in advance
`
Upvotes: 2
Views: 151
Reputation: 24930
If I understand you correctly, you are probably looking for something like the below. It's very simplified, just to respond to the particular issue:
Lets assume you have this dataframe:
data = ['string1', 'string2', 'string3', 'string2', 'string1', 'string4']
result = ['','','','','','']
df = pd.DataFrame(columns=["my_column",'result'])
df['my_column'],df['result'] = data,result
We can skip the duplicates in performing operations, but assign the results of these operations to all rows, including the duplicates:
for val in df.my_column.unique():
state = "Yes" if random.randint(1,2)==1 else "No"
#in your actual code, the above line will probably have to be replaced with status.text
df.loc[df['my_column'] == val, 'result'] = state
df
Random output:
my_column result
0 string1 No
1 string2 Yes
2 string3 Yes
3 string2 Yes
4 string1 No
5 string4 No
Upvotes: 1