Reputation: 59
I have a dataframe called wppmorocco with a number of websites, it looks like this:
URL
11 http://projects.worldbank.org/en/projects-operations/project-detail/P124639
4 http://projects.worldbank.org/en/projects-operations/project-detail/P130891
13 http://projects.worldbank.org/en/projects-operations/project-detail/P133312
14 http://projects.worldbank.org/en/projects-operations/project-detail/P133312
3 http://projects.worldbank.org/en/projects-operations/project-detail/P146970
12 http://projects.worldbank.org/en/projects-operations/project-detail/P147760
15 http://projects.worldbank.org/en/projects-operations/project-detail/P150520
7 http://projects.worldbank.org/en/projects-operations/project-detail/P151072
8 http://projects.worldbank.org/en/projects-operations/project-detail/P151072
10 http://projects.worldbank.org/en/projects-operations/project-detail/P151072
5 http://projects.worldbank.org/en/projects-operations/project-detail/P155522
16 http://projects.worldbank.org/en/projects-operations/project-detail/P155522
19 http://projects.worldbank.org/en/projects-operations/project-detail/P160661
6 http://projects.worldbank.org/en/projects-operations/project-detail/P162637
18 http://projects.worldbank.org/en/projects-operations/project-detail/P165228
17 http://projects.worldbank.org/en/projects-operations/project-detail/P167788
I would like to open each website and extract the "abstract" text. However, soup doesn't detect the source code, so I had to use Selenium. My problem is that it also outputs an error message stating that the element couldn't be located. My end goal is to only keep the URLs that contain some specific keywords. My code for opening the links and retrieving the text is as follows:
description = []
#Filtering dataframe for only relevant opportunities
from bs4 import BeautifulSoup as bs
for link in wbppmorocco.iterrows():
url = link[1]['URL']
print(url)
driver.get(url)
time.sleep(20)
desc=driver.find_element_by_class_name('more _loop_lead_paragraph_sm')
print(desc.text)
description.append(desc)
Any tips would be greatly appreciated!
Upvotes: 0
Views: 94
Reputation: 23825
The worldbank website that you use is doing HTTP (API) call in order to get the data for you.
You can do that as well.
Here is a functional example:
import requests
ids = [12,45,67] # dummy ids
for _id in ids :
r = requests.post(f'https://search.worldbank.org/api/v2/projects?format=json&fl=*&id={_id}}&apilang=en')
if r.status_code == 200:
print(r.json())
else:
print(r.status_code)
So all you have to do is to loop over the project ids you have and read the data.
Upvotes: 1