Downloading PDF's using Python webscraping not working

Here is my code:

import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "https://mathsmadeeasy.co.uk/gcse-maths-revision/"

#If there is no such folder, the script will create one automatically
folder_location = r'E:\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)

response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")     
for link in soup.select("a[href$='.pdf']"):
    #Name the pdf files using the last portion of each link which are unique in this case
    filename = os.path.join(folder_location,link['href'].split('/')[-1])
    with open(filename, 'wb') as f:
        f.write(requests.get(urljoin(url,link['href'])).content)

Any help as to why the code does not download any of my files format maths revision site. Thanks.

Upvotes: 2

Views: 270

Answers (1)

CyberFoxar
CyberFoxar

Reputation: 569

Looking at the page itself, while it may look like it is static, it isn't. The content you are trying to access is gated behind some fancy javascript loading. What I've done to assess that is simply logging the page that BS4 actually got and opening it in a text editor:

with open(folder_location+"\page.html", 'wb') as f:
    f.write(response.content) 

By the look of it, the page is remplacing placeholders with JS, as hinted by the comment line 70 of the HTML file: // interpolate json by replacing placeholders with variables

For solutions to your problems, it seems BS4 is not able to load Javascript. I suggest looking at this answer for someone who had a similar problem. I also suggest looking into Scrapy if you intend to do some more complex web scraping.

Upvotes: 3

Related Questions