Fatima El Mansouri
Fatima El Mansouri

Reputation: 59

Websraping multiple websites with links saved in a dataframe, Unable to locate element error

I have a dataframe called wppmorocco with a number of websites, it looks like this:

    URL
11  http://projects.worldbank.org/en/projects-operations/project-detail/P124639
4   http://projects.worldbank.org/en/projects-operations/project-detail/P130891
13  http://projects.worldbank.org/en/projects-operations/project-detail/P133312
14  http://projects.worldbank.org/en/projects-operations/project-detail/P133312
3   http://projects.worldbank.org/en/projects-operations/project-detail/P146970
12  http://projects.worldbank.org/en/projects-operations/project-detail/P147760
15  http://projects.worldbank.org/en/projects-operations/project-detail/P150520
7   http://projects.worldbank.org/en/projects-operations/project-detail/P151072
8   http://projects.worldbank.org/en/projects-operations/project-detail/P151072
10  http://projects.worldbank.org/en/projects-operations/project-detail/P151072
5   http://projects.worldbank.org/en/projects-operations/project-detail/P155522
16  http://projects.worldbank.org/en/projects-operations/project-detail/P155522
19  http://projects.worldbank.org/en/projects-operations/project-detail/P160661
6   http://projects.worldbank.org/en/projects-operations/project-detail/P162637
18  http://projects.worldbank.org/en/projects-operations/project-detail/P165228
17  http://projects.worldbank.org/en/projects-operations/project-detail/P167788

I would like to open each website and extract the "abstract" text. However, soup doesn't detect the source code, so I had to use Selenium. My problem is that it also outputs an error message stating that the element couldn't be located. My end goal is to only keep the URLs that contain some specific keywords. My code for opening the links and retrieving the text is as follows:

description = []
#Filtering dataframe for only relevant opportunities
from bs4 import BeautifulSoup as bs
for link in wbppmorocco.iterrows():
    url = link[1]['URL']
    print(url)
    driver.get(url)
    time.sleep(20)
    desc=driver.find_element_by_class_name('more _loop_lead_paragraph_sm')
    print(desc.text)
    description.append(desc)

Any tips would be greatly appreciated!

Upvotes: 0

Views: 94

Answers (1)

balderman
balderman

Reputation: 23825

The worldbank website that you use is doing HTTP (API) call in order to get the data for you.

You can do that as well.

Here is a functional example:

import requests

ids = [12,45,67] # dummy ids
for _id in ids :
    r = requests.post(f'https://search.worldbank.org/api/v2/projects?format=json&fl=*&id={_id}}&apilang=en')
    if r.status_code == 200:
        print(r.json())
    else:
        print(r.status_code)

So all you have to do is to loop over the project ids you have and read the data.

Upvotes: 1

Related Questions