Data_Science_Mick
Data_Science_Mick

Reputation: 25

Downloading PDFs from a Website using Python

I am completing a Masters in Data Science. I am working on a Text Mining assignment. In this project, I intend to download several PDFs from a website. In this case, I want to scrape and save the document called "Prospectus".

Below is the code which I am using in Python. The prospectus which I wish to download is show in screenshot below. However, the script returns different documents on the web page. Is there something which I need to change within my script?

import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "https://www.ishares.com/us/products/239726/ishares-core-sp-500-etf"

# If there is no such folder, the script will create one automatically
folder_location = r'.\Output'
if not os.path.exists(folder_location): os.mkdir(folder_location)

response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
# Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location, link['href'].split('/')[-1])
with open(filename, 'wb') as f:
    f.write(requests.get(urljoin(url, link['href'])).content)

Prospectus Image

Upvotes: 1

Views: 1917

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195408

Try:

import re
import requests
import urllib.parse
from bs4 import BeautifulSoup

url = "https://www.ishares.com/us/products/239726/ishares-core-sp-500-etf"
html = requests.get(url).text

ajax_url = (
    "https://www.ishares.com"
    + re.search(r'dataAjaxUrl = "([^"]+)"', html).group(1)
    + "?action=ajax"
)

soup = BeautifulSoup(requests.get(ajax_url).content, "html.parser")
prospectus_url = (
    "https://www.ishares.com"
    + soup.select_one("a:-soup-contains(Prospectus)")["href"]
)

pdf_url = (
    "https://www.ishares.com"
    + urllib.parse.parse_qs(prospectus_url)["iframeUrlOverride"][0]
)

print("Downloading", pdf_url)
with open(pdf_url.split("/")[-1], "wb") as f_out:
    f_out.write(requests.get(pdf_url).content)

Prints:

Downloading https://www.ishares.com/us/literature/prospectus/p-ishares-core-s-and-p-500-etf-3-31.pdf

and saves p-ishares-core-s-and-p-500-etf-3-31.pdf:

-rw-r--r-- 1 root root 325016 okt 17 22:31 p-ishares-core-s-and-p-500-etf-3-31.pdf

Upvotes: 1

Related Questions