Reputation: 3490
I wrote a function to find all .pdf files from a web-page & download them. It works well when the link is publicly accessible but when I use it for a course website (which can only be accessed on my university's internet), the pdfs downloaded are corrupted and cannot be opened.
How can I fix it?
def get_pdfs(my_url):
html = urllib2.urlopen(my_url).read()
html_page = BeautifulSoup(html)
current_link = ''
links = []
for link in html_page.find_all('a'):
current_link = link.get('href')
if current_link.endswith('pdf'):
links.append(my_url + current_link)
print(links)
for link in links:
#urlretrieve(link)
wget.download(link)
get_pdfs('https://grader.eecs.jacobs-university.de/courses/320241/2019_2/')
When I use this grader link, the current_link is something like /courses/320241/2019_2/lectures/lecture_7_8.pdf
but the /courses/320241/2019_2/
part is already included in the my_url and when I append it, it obviously doesn't work. However, the function works perfectly for [this link][1]:
Is there a way I can use the same function to work with both types of links?
Upvotes: 1
Views: 1879
Reputation: 24940
OK, I think I understand the issue now. Try the code below on your data. I think it works, but obviously I couldn't try it directly on the page requiring login. Also, I changed your structure and variable definitions a bit, because I find it easier to think that way, but if it works, you can easily modify it to suit your own tastes.
Anyway, here goes:
import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urlparse
my_urls = ['https://cnds.jacobs-university.de/courses/os-2019/', 'https://grader.eecs.jacobs-university.de/courses/320241/2019_2']
links = []
for url in my_urls:
resp = requests.get(url)
soup = bs(resp.text,'lxml')
og = soup.find("meta", property="og:url")
base = urlparse(url)
for link in soup.find_all('a'):
current_link = link.get('href')
if current_link.endswith('pdf'):
if og:
links.append(og["content"] + current_link)
else:
links.append(base.scheme+"://"+base.netloc + current_link)
for link in links:
print(link)
Upvotes: 1