\n
When I use google to inspect it, I see this:
\n\nBut When I use BeautifulSoup, I can find the Href but I am unable to find the text that says "Technical Assistance Reports".
\nimport requests\nurl2 = "https://www.adb.org/projects/54128-001/main#project-documents"\nresponse = requests.get(url2)\nfrom bs4 import BeautifulSoup\nsoup = BeautifulSoup(response.text)\n#print(soup.prettify())\nparent = soup.find_all('tr')\nparent[1].find_all('td')\n
\nI get this:
\n[<td>\n <a href="/projects/documents/reg-54128-001-tar">Implementing the Cities Development Initiative for Asia: Technical Assistance Report</a> </td>,\n <td class="width-2-12 views-field views-field-field-date-content">\n <span class="date-display-single" content="2020-09-30T00:00:00+08:00" datatype="xsd:dateTime" property="">Sep 2020</span> </td>]\n
\nThe Href is there and the date is there but I can't find the text "Technical Assistance Reports". The middle "td" isn't showing up.
\nThis example only has the one document listed on the webpage but other examples may have many or none. Ideally, I would like to be able loop through all the "tr" and only pick up the Href if the document type is "Technical Assistance Report" or other stuff that I am looking for. What am I doing wrong here and what's a good way to accomplish this?
\n","author":{"@type":"Person","name":"Amatya"},"upvoteCount":0,"answerCount":1,"acceptedAnswer":{"@type":"Answer","text":"You could get the sitemap and glob just the tar files from that.
\nhttps://www.adb.org/sitemap.xml?page=1
\n\n","author":{"@type":"Person","name":"Dr Pi"},"upvoteCount":2}}}Reputation: 1243
From a given selection of webpages, I am trying to extract links from a table, conditional on the "document type" information being something specific. For example, on this website, I only want to get the Href if the document type is "Technical Assistance Reports".
When I use google to inspect it, I see this:
But When I use BeautifulSoup, I can find the Href but I am unable to find the text that says "Technical Assistance Reports".
import requests
url2 = "https://www.adb.org/projects/54128-001/main#project-documents"
response = requests.get(url2)
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text)
#print(soup.prettify())
parent = soup.find_all('tr')
parent[1].find_all('td')
I get this:
[<td>
<a href="/projects/documents/reg-54128-001-tar">Implementing the Cities Development Initiative for Asia: Technical Assistance Report</a> </td>,
<td class="width-2-12 views-field views-field-field-date-content">
<span class="date-display-single" content="2020-09-30T00:00:00+08:00" datatype="xsd:dateTime" property="">Sep 2020</span> </td>]
The Href is there and the date is there but I can't find the text "Technical Assistance Reports". The middle "td" isn't showing up.
This example only has the one document listed on the webpage but other examples may have many or none. Ideally, I would like to be able loop through all the "tr" and only pick up the Href if the document type is "Technical Assistance Report" or other stuff that I am looking for. What am I doing wrong here and what's a good way to accomplish this?
Upvotes: 0
Views: 81
Reputation: 417
You could get the sitemap and glob just the tar files from that.
https://www.adb.org/sitemap.xml?page=1
Upvotes: 2