Amatya
Amatya

Reputation: 1243

Beautiful Soup, conditionally extracting Href

From a given selection of webpages, I am trying to extract links from a table, conditional on the "document type" information being something specific. For example, on this website, I only want to get the Href if the document type is "Technical Assistance Reports".

enter image description here

When I use google to inspect it, I see this:

enter image description here

But When I use BeautifulSoup, I can find the Href but I am unable to find the text that says "Technical Assistance Reports".

import requests
url2 = "https://www.adb.org/projects/54128-001/main#project-documents"
response = requests.get(url2)
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text)
#print(soup.prettify())
parent = soup.find_all('tr')
parent[1].find_all('td')

I get this:

[<td>
 <a href="/projects/documents/reg-54128-001-tar">Implementing the Cities Development Initiative for Asia: Technical Assistance Report</a> </td>,
 <td class="width-2-12 views-field views-field-field-date-content">
 <span class="date-display-single" content="2020-09-30T00:00:00+08:00" datatype="xsd:dateTime" property="">Sep 2020</span> </td>]

The Href is there and the date is there but I can't find the text "Technical Assistance Reports". The middle "td" isn't showing up.

This example only has the one document listed on the webpage but other examples may have many or none. Ideally, I would like to be able loop through all the "tr" and only pick up the Href if the document type is "Technical Assistance Report" or other stuff that I am looking for. What am I doing wrong here and what's a good way to accomplish this?

Upvotes: 0

Views: 81

Answers (1)

Dr Pi
Dr Pi

Reputation: 417

You could get the sitemap and glob just the tar files from that.

https://www.adb.org/sitemap.xml?page=1

enter image description here

Upvotes: 2

Related Questions