Aureon
Aureon

Reputation: 141

Download pdfs with python

I am trying to download several PDFs which are located in different hyperlinks in a single URL. My approach was first to retrieve the the URLs with contained the "fileEntryId" text which contains the PDFs, according to this link and secondly try to download the PDF files using this approach link.

This is "my" code so far:

import httplib2
from bs4 import BeautifulSoup, SoupStrainer
import re
import os
import requests
from urllib.parse import urljoin


http = httplib2.Http()
status, response = http.request('https://www.contraloria.gov.co/resultados/proceso-auditor/auditorias-liberadas/regalias/auditorias-regalias-liberadas-ano-2015')

for link in BeautifulSoup(response, 'html.parser', parse_only=SoupStrainer('a', href=re.compile('.*fileEntryId.*'))):
    if link.has_attr('href'):
        x=link['href']
        
        #If there is no such folder, the script will create one automatically
        folder_location = r'c:\webscraping'
        if not os.path.exists(folder_location):os.mkdir(folder_location)

        response = requests.get(x)
        soup= BeautifulSoup(response.text, "html.parser")     
        for link in soup.select("x"):
            #Name the pdf files using the last portion of each link which are unique in this case
            filename = os.path.join(folder_location,link['href'].split('/')[-1])
            with open(filename, 'wb') as f:
                f.write(requests.get(urljoin(url,link['href'])).content)

Thank you

Upvotes: 0

Views: 255

Answers (1)

MITHU
MITHU

Reputation: 154

Create a folder anywhere and put the script in that folder. When you run the script, you should get the downloaded pdf files within the folder. If for some reason the script doesn't work for you, make sure to check whether your bs4 version is up to date as I've used pseudo css selectors to target the required links.

import requests
from bs4 import BeautifulSoup

link = 'https://www.contraloria.gov.co/resultados/proceso-auditor/auditorias-liberadas/regalias/auditorias-regalias-liberadas-ano-2015'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    res = s.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select("table.table > tbody.table-data td.first > a[href*='fileEntryId']"):
        inner_link = item.get("href")
        resp = s.get(inner_link)
        soup = BeautifulSoup(resp.text,"lxml")
        pdf_link = soup.select_one("a.taglib-icon:contains('Descargar')").get("href")
        file_name = pdf_link.split("/")[-1].split("?")[0]
        with open(f"{file_name}.pdf","wb") as f:
            f.write(s.get(pdf_link).content)

Upvotes: 1

Related Questions