Reputation: 25
I am completing a Masters in Data Science. I am working on a Text Mining assignment. In this project, I intend to download several PDFs from a website. In this case, I want to scrape and save the document called "Prospectus".
Below is the code which I am using in Python. The prospectus which I wish to download is show in screenshot below. However, the script returns different documents on the web page. Is there something which I need to change within my script?
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://www.ishares.com/us/products/239726/ishares-core-sp-500-etf"
# If there is no such folder, the script will create one automatically
folder_location = r'.\Output'
if not os.path.exists(folder_location): os.mkdir(folder_location)
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
# Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location, link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url, link['href'])).content)
Upvotes: 1
Views: 1917
Reputation: 195408
Try:
import re
import requests
import urllib.parse
from bs4 import BeautifulSoup
url = "https://www.ishares.com/us/products/239726/ishares-core-sp-500-etf"
html = requests.get(url).text
ajax_url = (
"https://www.ishares.com"
+ re.search(r'dataAjaxUrl = "([^"]+)"', html).group(1)
+ "?action=ajax"
)
soup = BeautifulSoup(requests.get(ajax_url).content, "html.parser")
prospectus_url = (
"https://www.ishares.com"
+ soup.select_one("a:-soup-contains(Prospectus)")["href"]
)
pdf_url = (
"https://www.ishares.com"
+ urllib.parse.parse_qs(prospectus_url)["iframeUrlOverride"][0]
)
print("Downloading", pdf_url)
with open(pdf_url.split("/")[-1], "wb") as f_out:
f_out.write(requests.get(pdf_url).content)
Prints:
Downloading https://www.ishares.com/us/literature/prospectus/p-ishares-core-s-and-p-500-etf-3-31.pdf
and saves p-ishares-core-s-and-p-500-etf-3-31.pdf
:
-rw-r--r-- 1 root root 325016 okt 17 22:31 p-ishares-core-s-and-p-500-etf-3-31.pdf
Upvotes: 1