Reputation: 15
Andrej kindly helped me write this code, but now I'm wondering how to navigate to each of those pages and download all the PDFs that have the text/title "Public Comment" in the name?
import requests
from bs4 import BeautifulSoup
url = "https://www.ci.atherton.ca.us/Archive.aspx?AMID=41"
key = "Archive.aspx?ADID="
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for link in soup.find_all("a"):
if key in link.get("href", ""):
print("https://www.ci.atherton.ca.us/" + link.get("href"))
Prints:
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3581
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3570
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3564
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3559
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3556
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3554
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3552
Upvotes: 0
Views: 498
Reputation: 195408
Try:
import requests
from bs4 import BeautifulSoup
url = "https://www.ci.atherton.ca.us/Archive.aspx?AMID=41"
key = "Archive.aspx?ADID="
soup = BeautifulSoup(requests.get(url).content, "html.parser")
all_links = []
for link in soup.find_all("a"):
if key in link.get("href", ""):
all_links.append("https://www.ci.atherton.ca.us/" + link.get("href"))
for link in all_links:
print("Checking {}...".format(link))
soup = BeautifulSoup(requests.get(link).content, "html.parser")
for a in soup.find_all(
lambda tag: tag.name == "a" and "public comment" in tag.text.lower()
):
pdf_link = "https://www.ci.atherton.ca.us" + a["href"]
filename = a["href"].split("/")[-1] + ".pdf"
print("Downloading {} to {}".format(pdf_link, filename))
with open(filename, "wb") as f_out:
f_out.write(requests.get(pdf_link).content)
Prints:
...
Checking https://www.ci.atherton.ca.us/Archive.aspx?ADID=3514...
Checking https://www.ci.atherton.ca.us/Archive.aspx?ADID=3505...
Downloading https://www.ci.atherton.ca.us/DocumentCenter/View/8628/Public-Comments-1202021---ITEM-No-15 to Public-Comments-1202021---ITEM-No-15.pdf
Checking https://www.ci.atherton.ca.us/Archive.aspx?ADID=3498...
Checking https://www.ci.atherton.ca.us/Archive.aspx?ADID=3479...
Downloading https://www.ci.atherton.ca.us/DocumentCenter/View/8516/Wayne-Lee---Public-Comments_12162020 to Wayne-Lee---Public-Comments_12162020.pdf
Downloading https://www.ci.atherton.ca.us/DocumentCenter/View/8532/Discher-Stephanie_Public-Comments_12162020 to Discher-Stephanie_Public-Comments_12162020.pdf
...
And saves the PDFs from the URL to files.
Upvotes: 1