Reputation: 141
I am trying to download several PDFs which are located in different hyperlinks in a single URL. My approach was first to retrieve the the URLs with contained the "fileEntryId" text which contains the PDFs, according to this link and secondly try to download the PDF files using this approach link.
This is "my" code so far:
import httplib2
from bs4 import BeautifulSoup, SoupStrainer
import re
import os
import requests
from urllib.parse import urljoin
http = httplib2.Http()
status, response = http.request('https://www.contraloria.gov.co/resultados/proceso-auditor/auditorias-liberadas/regalias/auditorias-regalias-liberadas-ano-2015')
for link in BeautifulSoup(response, 'html.parser', parse_only=SoupStrainer('a', href=re.compile('.*fileEntryId.*'))):
if link.has_attr('href'):
x=link['href']
#If there is no such folder, the script will create one automatically
folder_location = r'c:\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(x)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("x"):
#Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
Thank you
Upvotes: 0
Views: 255
Reputation: 154
Create a folder anywhere and put the script in that folder. When you run the script, you should get the downloaded pdf files within the folder. If for some reason the script doesn't work for you, make sure to check whether your bs4 version is up to date as I've used pseudo css selectors to target the required links.
import requests
from bs4 import BeautifulSoup
link = 'https://www.contraloria.gov.co/resultados/proceso-auditor/auditorias-liberadas/regalias/auditorias-regalias-liberadas-ano-2015'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
res = s.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("table.table > tbody.table-data td.first > a[href*='fileEntryId']"):
inner_link = item.get("href")
resp = s.get(inner_link)
soup = BeautifulSoup(resp.text,"lxml")
pdf_link = soup.select_one("a.taglib-icon:contains('Descargar')").get("href")
file_name = pdf_link.split("/")[-1].split("?")[0]
with open(f"{file_name}.pdf","wb") as f:
f.write(s.get(pdf_link).content)
Upvotes: 1