sebasmps
sebasmps

Reputation: 11

Download pdf files without .pdf url

I am trying to download PDF files from this website.

I am new to Python and am currently learning about the software. I have downloaded packages such as urllib and bs4. However, there is no .pdf extension in any of the URLs. Instead, each one has the following format: http://www.smv.gob.pe/ConsultasP8/documento.aspx?vidDoc={.....}.

I have tried to use the soup.find_all command. However, this was not successful.

from urllib import request
from bs4 import BeautifulSoup
import re
import os
import urllib

url="http://www.smv.gob.pe/frm_hechosdeImportanciaDia?data=38C2EC33FA106691BB5B5039DACFDF50795D8EC3AF"
response = request.urlopen(url).read()
soup= BeautifulSoup(response, "html.parser")    
links = soup.find_all('a', href=re.compile(r'(http://www.smv.gob.pe/ConsultasP8/documento.aspx?)'))
print(links)

Upvotes: 1

Views: 703

Answers (1)

Sergio Pulgarin
Sergio Pulgarin

Reputation: 929

This works for me:

import re

import requests
from bs4 import BeautifulSoup

url = "http://www.smv.gob.pe/frm_hechosdeImportanciaDia?data=38C2EC33FA106691BB5B5039DACFDF50795D8EC3AF"
response = requests.get(url).content
soup = BeautifulSoup(response, "html.parser")
links = soup.find_all('a', href=re.compile(r'(http://www.smv.gob.pe/ConsultasP8/documento.aspx?)'))
links = [l['href'] for l in links]
print(links)

Only difference is that I use requests because I'm used to it, and I take the href attribute for each of the returned Tag from BeautifulSoup.

Upvotes: 1

Related Questions