Beautiful Soup, conditionally extracting Href

Question

From a given selection of webpages, I am trying to extract links from a table, conditional on the "document type" information being something specific. For example, on this website, I only want to get the Href if the document type is "Technical Assistance Reports".

When I use google to inspect it, I see this:

But When I use BeautifulSoup, I can find the Href but I am unable to find the text that says "Technical Assistance Reports".

import requests
url2 = "https://www.adb.org/projects/54128-001/main#project-documents"
response = requests.get(url2)
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text)
#print(soup.prettify())
parent = soup.find_all('tr')
parent[1].find_all('td')

I get this:

[
 Implementing the Cities Development Initiative for Asia: Technical Assistance Report ,
 
 Sep 2020 ]

The Href is there and the date is there but I can't find the text "Technical Assistance Reports". The middle "td" isn't showing up.

This example only has the one document listed on the webpage but other examples may have many or none. Ideally, I would like to be able loop through all the "tr" and only pick up the Href if the document type is "Technical Assistance Report" or other stuff that I am looking for. What am I doing wrong here and what's a good way to accomplish this?

Dr Pi · Accepted Answer

You could get the sitemap and glob just the tar files from that.

https://www.adb.org/sitemap.xml?page=1

Beautiful Soup, conditionally extracting Href

Answers (1)

Related Questions