Reputation: 61
Using Python
, I'd like to download all pdf files(except names that begin by "INS") from website
url_asn="https://www.asn.fr/recherche?filter_year[from]={}&filter_year[to]={}&limit=50&search_content_type=&search_text={}&sort_type=date&page={}"
if link['href'] is not pdf
, then open it and download pdf files if they exist - for each page, interate to last page.
Upvotes: 0
Views: 1224
Reputation: 480
probably this will work? I have added comments for every line.
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = " " # url to scrape
#If there is no such folder, the script will create one automatically
folder_location = r'/webscraping' # folder location
# create folder if it doesn't exist
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url) # get the html
soup= BeautifulSoup(response.text, "html.parser") # parse the html
for link in soup.select("a[href$='.pdf']"): # select all the pdf links
#Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location,link['href'].split('/')[-1]) # join the folder location and the filename
with open(filename, 'wb') as f:
# open the file and write the pdf
f.write(requests.get(urljoin(url,link['href'])).content)
Upvotes: 2