lsignori
lsignori

Reputation: 15

Beautiful Soup to Download PDFs from Multiple Pages

Andrej kindly helped me write this code, but now I'm wondering how to navigate to each of those pages and download all the PDFs that have the text/title "Public Comment" in the name?

import requests
from bs4 import BeautifulSoup

url = "https://www.ci.atherton.ca.us/Archive.aspx?AMID=41"
key = "Archive.aspx?ADID="

soup = BeautifulSoup(requests.get(url).content, "html.parser")

for link in soup.find_all("a"):
    if key in link.get("href", ""):
        print("https://www.ci.atherton.ca.us/" + link.get("href"))

Prints:

https://www.ci.atherton.ca.us/Archive.aspx?ADID=3581
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3570
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3564
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3559
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3556
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3554
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3552

Upvotes: 0

Views: 498

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195408

Try:

import requests
from bs4 import BeautifulSoup

url = "https://www.ci.atherton.ca.us/Archive.aspx?AMID=41"
key = "Archive.aspx?ADID="

soup = BeautifulSoup(requests.get(url).content, "html.parser")

all_links = []
for link in soup.find_all("a"):
    if key in link.get("href", ""):
        all_links.append("https://www.ci.atherton.ca.us/" + link.get("href"))

for link in all_links:
    print("Checking {}...".format(link))
    soup = BeautifulSoup(requests.get(link).content, "html.parser")
    for a in soup.find_all(
        lambda tag: tag.name == "a" and "public comment" in tag.text.lower()
    ):
        pdf_link = "https://www.ci.atherton.ca.us" + a["href"]
        filename = a["href"].split("/")[-1] + ".pdf"
        print("Downloading {} to {}".format(pdf_link, filename))
        with open(filename, "wb") as f_out:
            f_out.write(requests.get(pdf_link).content)

Prints:

...

Checking https://www.ci.atherton.ca.us/Archive.aspx?ADID=3514...
Checking https://www.ci.atherton.ca.us/Archive.aspx?ADID=3505...
Downloading https://www.ci.atherton.ca.us/DocumentCenter/View/8628/Public-Comments-1202021---ITEM-No-15 to Public-Comments-1202021---ITEM-No-15.pdf
Checking https://www.ci.atherton.ca.us/Archive.aspx?ADID=3498...
Checking https://www.ci.atherton.ca.us/Archive.aspx?ADID=3479...
Downloading https://www.ci.atherton.ca.us/DocumentCenter/View/8516/Wayne-Lee---Public-Comments_12162020 to Wayne-Lee---Public-Comments_12162020.pdf
Downloading https://www.ci.atherton.ca.us/DocumentCenter/View/8532/Discher-Stephanie_Public-Comments_12162020 to Discher-Stephanie_Public-Comments_12162020.pdf

...

And saves the PDFs from the URL to files.

Upvotes: 1

Related Questions